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Abstract 



This paper provides a guide to calculating statistical power for the complex multilevel designs 
that are used in most field studies in education research. For multilevel evaluation studies in the 
field of education, it is important to account for the impact of clustering on the standard errors 
of estimates of treatment effects. Using ideas from survey research, the paper explains how 
sample design induces random variation in the quantities observed in a randomized experiment, 
and how this random variation relates to statistical power. The manner in which statistical 
power deperids upon the values of intraclass correlations, sample sizes at the various levels, the 
standardized average treatment effect (effect size), the multiple correlation between covariates 
and the outcome at different levels, and the heterogeneity of treatment effects across sampling 
units is illustrated. Both hierarchical and randomized block designs are considered. The paper 
demonstrates that statistical power in complex designs involving clustered sampling can be 
computed simply from standard power tables using the idea of operational effect sizes: effect 
sizes multiplied by a design effect that depends on features of the complex experimental design. 
These concepts are applied to provide methods for computing power for each of the research 
designs most frequently used in education research. 
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Chapter 1: Introduction 



Experimental evaluation seeks to make possible valid inferences about the effects of a treatment, 
or intervention, in question. However, a research study can make an invalid inference by 
concluding that a treatment has an effect when it does not: This mistaken conclusion is called a 
Type I error in statistical decision theory. Statistical significance testing is designed to control 
the chance of making Type I errors. A second way that a research study can make an invalid 
inference is by failing to detect that a treatment has an effect when the true treatment effect is 
nonzero. This is called a Type II error in statistical decision theory. Statistical power is defined 
as the probability that a research study avoids a Type II error by correctly rejecting a null 
hypothesis of zero treatment effect. Low statistical power increases the probability of obtaining a 
Type II error and is a major threat to the statistical conclusion validity of educational research 
studies (Shadish, Cook, and Campbell 2002). 

Statistical power analysis is a method of determining the probability that a proposed research 
design will detect the anticipated effects of a treatment. It helps the researcher determine whether 
a study design should be modified so that it will have adequate power for detecting effects. The 
purpose of this paper is to provide an introduction to the computation of statistical power for 
education field studies and to discuss research design parameters that directly impact statistical 
power. Most field studies in education involve complex designs, typically involving clustered 
sampling of students within classes or schools and assignment to treatments by groups (such as 
classrooms or schools). As will be expanded upon later in the paper, clustering directly impacts 
statistical power because the relationships among units within clusters usually implies a greater 
similarity of outcomes for units within a cluster than for units coming from different clusters. For 
example, if there were two third-grade classrooms in a school, one would expect greater 
similarities among students within each of the classrooms than among students from different 
classrooms. This concept, represented statistically by the intraclass correlation coefficient (ICC), 
plays a crucial role in the power analysis of designs with clustering (Donner and Klar 2000; 
Murray 1998; Shadish, Cook, and Campbell 2002). 

This paper is intended for education researchers who are familiar with basic statistics concepts 
but do not consider themselves to be experts in statistics. Often, education researchers have 
training in statistical power analysis that is limited to studies that have relatively simple designs 
(e.g., one level of sampling and individual randomization). This paper provides a guide for 
calculating statistical power for more complicated multilevel designs that are used in most field 
studies in education. 

This paper begins with an explanation of sampling theory applied to education research and 
shows that sample design in education research can be understood in terms of ideas from survey 
research. This conceptual transferability is possible because, similar to survey data and designs, 
the data considered in many — if not most — educational applications are not obtained from a 
simple random sample, but rather from a more complex sample design. As a result, the analyses 
of the research and assumptions about the research designs must be addressed accordingly. 
Understanding how sample design induces random variation in the quantities observed in a 
randomized experiment is a necessary precursor to understanding statistical power. 
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Next, the topics of hierarchical designs and randomized-block designs — the two major classes of 
experimental designs that are predominantly used in education research — are explicated to 
explore the concept of multiple levels of analysis and how to address them in education research. 
The concept of statistical power is then introduced. This paper contains demonstrations that 
statistical power in complex designs involving clustered sampling can be computed simply using 
the idea of operational effect sizes — effect sizes multiplied by a design effect that depends on 
features of the complex experimental design. Finally, these concepts are applied to provide 
methods for computing power for each of the 10 research designs most frequently used in 
education research. 

This paper has five appendices. Appendix A provides fonnulas for computing design effects in 
multilevel hierarchical designs. Appendix B provides fonnulas for computing design effects in 
multilevel randomized-block designs. Appendix C details methods for computing power in three- 
level randomized-block designs. Appendix D describes the multilevel models on which power 
computations are based. Finally, Appendix E provides a glossary of technical tenns and the page 
numbers on which these terms are used. This paper also contains two tables that can be used for 
determining the power of the test for treatment effects in commonly used education research 
designs. 

Readers should remember that assumptions and values used in the demonstrations throughout the 
report are intended for illustration only and are not intended to suggest reference values to use 
when planning other research studies. 
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Chapter 2: Sampling and Sample Design 



Statistical analyses are tools for drawing inferences in the presence of random fluctuations or 
uncertainty (“randomness” and “uncertainty” are used synonymously in the field of research, but 
for the sake of simplicity, “randomness” will be used throughout this paper). The conceptual 
models on which statistical analysis methods are based depend on the idea that the randomness 
comes from sampling. Hence, an understanding of how sample design induces random variation 
in the quantities observed in a randomized experiment is a prerequisite for understanding 
statistical power. This crucial point is often misunderstood. 

Sample Designs, Sample Surveys, and Hierarchical Structures 

The sampling design is the process by which individuals from the population are selected for the 
sample. Most statistical and education research — and most statistical theory — begins with the 
assumption that the data being considered come from a simple random sample. Unfortunately, 
the data considered in many, if not most, educational applications are not obtained from a simple 
random sample but rather from a more complex sample design. This section reviews two crucial 
concepts from sample design — randomized-block design and hierarchical design — and applies 
them to designs for education research. 

A population of potential observations can often be identified as having some structure that 
makes it non-uniform in obvious ways. In survey research, this structure is often geographic 
(e.g., states; smaller geographic divisions, such as census tracts; or still smaller geographic 
divisions, such as neighborhood blocks), but it could also be demographic (e.g., groups defined 
by age, gender, or race). In education studies of children in schools, the relevant population 
structure is often defined by the hierarchical organization of the educational system (i.e., children 
are nested within classrooms, classrooms are nested within schools, and schools are nested 
within school districts). 

The fact that populations are structured (and the same population may be structured in several 
different ways) does not necessarily mean that simple random sampling is impossible, nor does it 
affect the properties of statistics computed from simple random samples of these populations. 
Population structure, in fact, can provide opportunities to collect samples in ways that have 
practical advantages. For example, a simple random sample of fourth-grade students in a state 
would begin with a list of all fourth-graders in the state and then use a random device to pick the 
required number of students from the list. This sampling would typically result in a sample with 
very few students in any one classroom or school and a relatively large number of schools. Such 
a sample would be difficult and expensive to collect because it would involve obtaining data 
from many different classrooms and schools. Additionally, it may be impossible to obtain a list 
of all fourth-graders in the state but easy enough to obtain a list of all schools in the state. In this 
case, one may obtain a sample of fourth-grade students in two stages. In the first stage, a simple 
random sample of schools is selected. In the second stage, a sample of fourth-graders is collected 
from within each school selected in the first stage of sampling. Samples obtained in this manner 
are called hierarchical or clustered samples. Thus, clustered sampling designs provide a 
mechanism for obtaining a random sample from the desired population when a simple random 
sampling design is either impossible or not cost effective. 
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Stratification and Clustered Sampling 

Sample survey designs almost always use one of two methods to simplify data collection. Which 
of the two methods is used has important implications for statistical analysis. One method is 
stratification, which is the division of the population into subsets (i.e., strata) that do not overlap 
(such as gender or achievement level) (Shadish, Cook, and Campbell 2002). The sample is then 
drawn so that some individuals are selected within each stratum. The second method used to 
simplify data collection is clustered sampling. Clustered sampling also begins with the division 
of the population into non-overlapping subsets (now called clusters), but in the case of clustered 
sampling, individuals are not strategically or intentionally sampled from each of the clusters; 
instead, only individuals from a sample of the clusters are selected into the sample. 

Operationally, a cluster sample is drawn in two stages, which is why clustered sampling is often 
called a two-stage or two-level sampling. In the first stage, a simple random sample of clusters, 
such as schools, is selected, and then a simple random sample of individuals, such as students 
within these schools, is drawn within each of the clusters that were selected at the first stage. The 
distinction between stratified sampling and clustered sampling lies in whether some individuals 
are included in the sample from every one of the subdivisions of the population (such as gender 
and achievement level). In other words, if every one of the clusters in the population is 
intentionally included in the sample, the cluster sample becomes a stratified sample. Therefore, 
stratified samples include individuals from each of the subdivisions, and clustered samples do 
not. 

Clustered sampling involving clusters at more than one level of the population is also possible, 
and this practice is widely used in survey sampling. For example, one might choose first to 
sample school districts (clusters at one level), then sample schools (subclusters at a lower level), 
and then sample children within schools at the third level. Alternatively, one might sample 
schools (clusters at one level), then sample classrooms (subclusters at a lower level), and then 
sample children within classrooms at the third level. Such sampling designs are called three- 
stage samples or three-level sampling designs. In the case of two-level or three-level sampling 
designs, whether the subdivisions of the population are clusters or strata depends upon whether 
all of the subdivisions in the population are included in the sample. This determination, in turn, 
depends on the definition of the population of interest. 

Population of Interest 

The concept of population of interest is more conceptually slippery in intervention studies than in 
sample surveys. In sample surveys, the population of interest is typically a static population at 
some moment in time, such as high school graduates, or students needing a specific service, such 
as a behavioral intervention. In such cases, it is fairly easy to determine what all the population 
subdivisions might be and whether they are all included in the sample. In intervention studies, 
however, the population may or may not be a static population. For instance, a population of 
students could be identified as requiring intensive intervention, based on a score on a screening 
measure. Once these students had received the intervention, however, it is possible that some of 
them would be designated as no longer needing the intervention because they had reached certain 
benchmarks. Thus, the population in this case is not necessarily static. 
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In some cases, the population of interest may be fixed in both time and place. The most 
important example is in studies where the primary question may be, “Did the treatment produce 
an effect on the students studied in the classrooms and schools that were part of the study?” For 
instance, did an intervention work well with displaced survivors of Hurricane Katrina in New 
Orleans who are eligible for counseling services in the elementary grades? In such cases, it is 
reasonable to consider the particular schools and classrooms in the study to define the population 
of interest. Only the sampling of students into schools and classrooms is a source of sampling 
randomness. Inferences are, therefore, restricted to — or conditional on — the particular schools 
and classrooms in the study. The inference model associated with this population definition is 
often called the conditional inference model (Hedges and Vevea 1998). In other cases, the 
population of interest may not be fixed in either time or place. The most obvious examples are 
effectiveness studies in which the object is to detennine whether the intervention would produce 
effects in a wider (perhaps national) population of schools and classrooms. For instance, will the 
intervention work with all survivors of natural disasters who are in need of counseling services in 
the elementary grades? In such cases, it is natural to consider the particular schools and 
classrooms in which the intervention takes place as only a sample of schools and classrooms 
from the larger population to which one might generalize. The inference model associated with 
this population definition is often called the unconditional inference model (Hedges and Vevea 
1998). 

There are further subtleties in these population definitions when considering population 
subdivisions, such as classrooms within schools (or if districts are assigned to treatments, schools 
within districts). Suppose that a particular school has only three fourth-grade classrooms. One 
might take the position that all of the fourth-grade children in the school at the present time are in 
one of these classrooms. Consequently, the three classrooms represent three strata of the school 
population. Alternatively, one might take the position that these three classrooms are just the 
three classrooms that happen to be in the school at the present time. Following this train of 
thought, one could conclude that there will be other classrooms in the school in the future and the 
three that happen to be there at the present time are a sample from a population of possible 
classrooms within that school. 

This latter argument may seem strange until one considers the analogous argument about 
students. Although all of the students within the classroom may be sampled, it would seem odd 
to say that the entire population of students who could have been in that classroom has been 
sampled. The more natural population definition seems to require that the students be considered 
a sample from a population of students who could have been sampled into that classroom. 
Similarly, if one imagines that the test scores of students in a particular classroom are influenced 
by the teacher they happen to have, then it is odd to think that the particular classrooms in a 
school at any one time — and the teachers that happen to be assigned to them — constitute a 
population of interest. A more natural population of interest seems to be one where the present 
teachers (and therefore classrooms) are a sample of potential teachers (and therefore classrooms). 

This paper focuses on the unconditional inference model, in which the observed set of population 
subdivisions is considered a sample of the larger population. Thus, schools and classrooms will 
be considered clusters in multistage clustered samples. The statistical analyses associated with 
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this inference model would naturally include schools and classrooms as random effects if they 
were considered clusters in the sampling design. 

Implications for Statistical Analysis 

If the details of population definition, such as the distinction between strata and clusters (or the 
conditional versus unconditional inference models), were merely a matter of tenninology, it 
would be of little scientific consequence. However, the distinction between clusters and strata is 
of major consequence to the sampling distribution of statistics computed from a sample. In order 
to make this distinction precise, it is necessary to introduce notation for certain population 
quantities and a specific model for quantifying the randomness about these quantities contained 
in any sample. 

Suppose that the population has a nested structure of individuals within schools (clusters), that 
the observations within the schools (clusters) are normally distributed about cluster means with a 
common within-cluster variance a w , and that the cluster means themselves are normally 
distributed about the overall population mean // with between-cluster variance as 2 . The total 
variance of the observations in the population is, therefore, 

Of = Os 2 + Ow ■ 



The amount of clustering by schools in the population is quantified by the intraclass correlation 
coefficient (ICC). The ICC describes the extent to which the students within a cluster (e.g., 
schools) are more alike than those in different clusters (e.g., different schools within the same 
district). 



P = 






2 2 
CTs+CTw 




( 1 ) 



Consider a simple random sample of size mn. A result from elementary statistics is that the 
variance of the mean of that simple random sample will be 

ot /mn. 



Now suppose that instead of a simple random sample, a sample of the same size mn is obtained 
by first sampling m schools (clusters) and then obtaining a simple random sample of n 
individuals within each school (cluster). The variance of the mean of this clustered sample would 
not be af/mn but 

[a-/ 2 /mn][\ + (n - 1 )/;]. 

The variance of the mean of the clustered sample is bigger by a factor [1 + (n - l)p], which is 
sometimes called the sample design effect (Kish 1965) or, more descriptively, the variance 
inflation factor. In this paper, however, the term “design effect” will be used to describe a 
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somewhat different quantity. We use “design effect” to describe the gain in precision obtained by 
having sampled more than one individual per cluster. 

The fact that sample means from clustered samples have more sampling randomness than those 
from unclustered (simple random) samples has implications for experimental design. Because 
treatment effects are defined as mean differences, when entire clusters are assigned to treatments, 
estimates of treatment effects are less precise than estimates from simple random samples of the 
same size. 



Sampling and Sample Design 



7 




Chapter 3: Experimental Designs 

Although many experimental designs have been developed (see Kirk 1995), the two most widely 
used experimental designs in education research are variants of one of two basic designs: the 
hierarchical (or nested) design and the (generalized) randomized-block design. These designs 
both assume a clustered sampling design, but they differ in how random assignment is made to 
treatments. Because nearly all field studies use a variant of one of these two basic designs, this 
paper will limit its discussion to these two designs. 

The Hierarchical Design 

In the hierarchical design, entire clusters (e.g., schools or classrooms) are assigned to treatments 
(Kirk 1995). Thus, every student in a cluster (a school or classroom) receives the same treatment, 
but students in different clusters may receive different treatments (e.g., grade-level mathematics 
tutoring). This design is sometimes called the cluster-randomized design because entire clusters 
are randomly assigned to treatments. The hierarchical design is perhaps the most widely used 
experimental design in education; it is desirable in situations where contamination among 
treatment groups would be possible if more than one treatment were present in the same cluster 
(e.g., in the same classroom). The hierarchical design minimizes potential problems of 
contamination between treatments because only one treatment is present in the same cluster (e.g., 
in the same classroom). In other words, this type of design helps to alleviate contamination 
because the whole cluster (e.g., the classroom) receives the treatment. The hierarchical design is 
also desirable in situations where it would be practically difficult to assign the treatment to some 
students in a cluster but not to others. Some treatments act at the level of the entire cluster (e.g., 
whole-school interventions such as positive behavior support). In these cases, it would be 
conceptually impossible to assign different treatments to different individuals within a cluster or 
to withhold treatment from one group but grant it to another within a cluster. 

Every hierarchical design involves at least one stage of clustered sampling (such as sampling 
schools first and then students within schools). However, the hierarchical design may also 
involve two or more stages of clustered sampling. For example, if schools are sampled first, then 
classrooms, then students within classrooms, the result is a three-stage sample with two levels of 
clustering. In hierarchical designs, random assignment to treatments occurs only to entire clusters 
at the highest level of clustering. For example, schools would be the unit of random assignment 
in a three-level design involving schools, classes, and students. If individuals within the same 
cluster are assigned to different treatments, the design is no longer considered a hierarchical 
design but instead is referred to as a randomized-block design, as described in the next section. 

The Randomized-Block Design 

In the randomized-block design, individuals within the same cluster are assigned to different 
treatments. For example, if students within the same school are assigned to either of two 
treatments (e.g., an intervention and a control), the design is a generalized randomized-block 
design (Kirk 1995). The randomized-block design is sometimes called a matched design because 
individuals assigned to treatments are matched within clusters or blocks. This design is not to be 
confused with quasi-experimental designs, which are also called matched designs. This paper 
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uses the term “matched design” to express that the students are not assigned to treatments as 
individuals but rather as a part of a block or cluster. A matched design is also sometimes called a 
multisite design because an important application is to multisite trials (such as multicenter 
medical trials) where the clusters are sites. This design has the advantage that treatment effects 
are estimated within clusters so that variation among clusters can be estimated separately from 
variation in treatment effects, and the variation among clusters does not increase the variation in 
the estimated average treatment effect. Relating this design to education research, one could say 
that when assignment is made to individuals within schools, school-specific attributes can be 
estimated separately from school-specific treatment effects. 

Every randomized-block design involves at least one stage of clustered sampling (such as 
sampling schools first and then students within schools before they are assigned to treatments 
within schools). Like hierarchical designs, randomized-block designs may also have more than 
one stage of clustering in the sample. For example, schools may be sampled first, then 
classrooms, and then students within classrooms. Given this three-stage sampling plan with two 
levels of clustering, there are two ways to assign treatments within clusters, leading to two 
different experimental designs. One design assigns entire intact classrooms to treatments, so that 
every individual within the same classroom receives the same treatment. The other design 
assigns individual students within classrooms to treatments, so that different students within the 
same classroom may have different treatments. Regardless of whether there are two or three 
levels of clustering, assignment to treatments must occur within the highest level clusters in the 
sampling design. If it does not (that is, if assignment to treatments occurs at the highest level of 
clustering), the design becomes a hierarchical design. 
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Chapter 4: Statistical Power 



Research studies evaluating education interventions are typically required to demonstrate that 
they are designed sufficiently well to provide sound evidence about the effects of the 
intervention in question. Many factors contribute to a successful intervention design, including 
feasibility of recruiting and retaining the sample, likelihood of successful implementation of the 
intervention, and adequate measurement of the outcome. One fundamental characteristic of a 
research design is whether it will have a good chance to detect the expected effect of the 
intervention or, alternatively, to detect the smallest effect that is deemed to be educationally 
meaningful. Statistical power is the probability that a test of the null hypothesis of no treatment 
effect will successfully reject the null hypothesis when a nonzero average treatment effect exists. 
In simple research designs that use simple random samples, statistical power (e.g., Cohen 1977) 
depends on three things: 

• the significance level of the test; 

• the expected size of the intervention effect (the effect size); and 

• the sample size. 

In multilevel research designs that involve clustering within schools or classrooms, power also 
depends on two other factors that are unique in multilevel designs: 

• how the sample size is distributed over the levels of the design (the sample size at 
each level); and 

• the extent of the clustering effects (typically measured by one or more intraclass 
correlation coefficients [ICCs] in hierarchical designs and by “heterogeneity” 
parameters that quantify the extent to which treatment effects vary across clusters 
in randomized-block designs). 

In designs with clustered samples, many different configurations of sample size at each level can 
lead to the same total sample size, and not all of these configurations lead to the same statistical 
power (Konstantopoulos 2008a; Konstantopoulos 2008b; Raudenbush 1997; Raudenbush and 
Liu 2000; Snijders 2005). For example, in a hierarchical design that assigns schools to 
treatments, one can achieve a total sample size of N= 1,000 by assigning to treatments m =10 
schools with n = 100 students each or m =100 schools of n = 10 students each. If there were no 
clustering effects, both choices would lead to identical statistical power. If there is clustering, 
these two choices of sample size allocation lead to very different statistical power. 

In designs with clustered samples, the extent of clustering may vary, depending on the 
population studied, age level, subject matter, and outcome measured. The degree of clustering is 
usually measured by comparing the variation among clusters to the total variation via an ICC 
such as that described earlier. Such ICCs measure (on a range from zero to one) the extent to 
which information within clusters is redundant. If the ICC is near zero, there is little clustering 
and there is little redundant infonnation within clusters. In this case, the statistical power will be 
close to that of a design that used simple random sampling and the same total sample size. When 
the ICC is near one, information within clusters is highly redundant, and thus, most of the 
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variation is between cluster means. In this case, the statistical power will be close to that of a 
design that used simple random sampling and a total sample size equal to the number of clusters. 
In other words, if the ICC is close to zero, the clusters are quite similar. When the ICC is close to 
one, the clusters are quite dissimilar, and thus, the variation between cluster means is much 
higher than the variation within the clusters. 

Use of Covariates to Increase Statistical Power 

One additional factor that can profoundly influence statistical power — in both simple designs and 
cluster-randomized designs — is whether covariates are used in the design to increase precision. 

In multilevel designs, as in single-level designs, the use of covariates can dramatically increase 
statistical power (Bloom, Richburg-Hayes, and Black 2007; Hedges and Hedberg 2007; 
Raudenbush, Martinez, and Spybrook 2007). Covariates increase power in multilevel designs by 
decreasing variation among clusters and within clusters. Reducing variation does the equivalent 
of increasing the effect size. Covariates that decrease variation among clusters are particularly 
useful in multilevel designs because they decrease the effect of clustering by decreasing the ICC. 

How Much Statistical Power Is Desirable? 

Because statistical power is the probability of making the correct decision when a treatment 
effect actually exists, high statistical power is desirable. However, in any given design, higher 
power is achieved only with larger sample sizes. Obtaining larger sample sizes typically requires 
the commitment of more resources (not only costs but also research staff and burden on schools). 
Therefore, the benefits associated with higher power must be weighed against the commitment of 
resources required to achieve these benefits. Technical means alone cannot resolve this 
cost/benefit judgment. Normative statistical practice seems to be that power of 0.8 or above is 
considered acceptable, but there is no reason to think that this figure is always appropriate. 
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Chapter 5: Computing Statistical Power in Complex Designs 



Specialized software is available for computing statistical power in group-randomized designs. 
One prominent example is Optimal Design, created by Stephen Raudenbush and his colleagues 
at the University of Michigan (Raudenbush et al. 2006). Although the Optimal Design software 
is useful, it does not cover all designs that are of interest to education researchers. For example, it 
does not cover designs using covariates at the individual level. Moreover, the use of software 
such as Optimal Design to compute power often fails to build intuition about how changes in 
parameters translate into changes in power. Finally, using Optimal Design properly requires a 
sophisticated understanding of the meaning of the input parameters. We hope that this paper will 
help education researchers build the intuition necessary to understand the conceptual meaning of 
parameters used in power analysis and how changes in design parameters translate into changes 
in power, whether they use Optimal Design or the methods described here. 

Although specialized software is available, it is not necessary to obtain software in order to 
compute power for multilevel designs. Statistical power in cluster-randomized designs depends 
on sample sizes, intraclass correlation coefficients (ICCs), and covariate effects only through a 
design effect (similar in spirit — but not identical to — the variance inflation factor encountered in 
connection with two-stage clustered sampling). Each design has its own design effect. Except for 
the influence of the design effect, computation of statistical power in complex designs is much 
like computing power in simple designs that do not involve clustering. 

Many tabulations (e.g., Cohen 1977) and computer programs (e.g., Borenstein, Rothstein, and 
Cohen 2001) are available for computing statistical power from designs involving simple random 
samples. The tables for computing power from the independent groups t test are perhaps the most 
widely available. Following Cohen’s framework, such tables typically provide power values 
based on sample sizes n for each treatment group (assumed to be equal) and effect size S 
(sometimes called Cohen’s d): 

§ = diZ£i_ (2 ) 

a 

where//; and //; are the population means in the treatment and control groups, respectively, and a 
is the within-group population standard deviation of the outcome. Table 1 is a slight variation on 
tables of this type. In Table 1, the statistical power of a two-sided test of the null hypothesis of 
no treatment effect at the a = 0.05 level of significance is tabulated as a function of the total 
sample size N and S. The only difference between Table 1 and the usual power tables, such as 
Table 2.3.2 on page 30 in Cohen (1977), is that N refers to the total sample size, rather than the 
sample size within each of two treatment groups. Thus, the row in Table 1 corresponding to any 
even value of N is equivalent to the row in Table 2.3.2 in Cohen with n = N / 2 . This slight 
modification is made for ease of use with certain of the operational sample sizes and operational 
effect sizes described below. 
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Computing Operational Sample and Effect Sizes 

Tables 1 and 2 and tables like Cohen’s (or the corresponding software) can be used to compute 
the power of the test used in the case of complex designs involving clustered sampling by 
appropriately adjusting the sample size and effect size used in the table. To use these tables to 
compute power for clustered designs, one must include in the table a total sample size N T based 
on the number of clusters (here called the operational sample size) and a synthetic effect size A T 
(here called the operational effect size) that will yield the appropriate power. Here the superscript 
T indicates that these quantities are used in the power tables. The design effect is different for 
each design, but the operational effect size will always be the product of the effect size (Cohen’s 
d) and the design effect for that design: 

operational effect size = ( effect size) x ( design effect) (3) 

or 

A 7 = 6 x ( design effect). 

Note that the term “design effect,” as used here and in the rest of this paper, is not the same as 
the variance inflation factor encountered in connection with multistage cluster samples. This 
definition also differs from other definitions of design effects. The design effect in this context 
reflects the gain in precision obtained by having sampled more than one individual per cluster. 
Specifically, it is the square root of the ratio of the precision of the treatment effect estimate with 
one individual per cluster to the precision of the treatment effect estimate with n individuals per 
cluster. We note that the increase in precision that results from adding individuals per cluster will 
depend on the relevant ICCs or (in randomized-block designs) the extent to which treatment 
effects vary across clusters. A summary of design effects for hierarchical designs is given in 
appendix A and for randomized-block designs in appendix B. 

Statistical power for multilevel designs can be computed without the use of tables, using 
functions in widely available statistical software. Using these functions to compute statistical 
power also involves using degrees of freedom based on the number of clusters and a simple 
function of the operational effect size (called the noncentrality parameter in statistics). The 
noncentrality parameter is different for each design, but it will always be the product of the 
operational effect size and a quantity that is a function of the sample size: 

noncentrality parameter = ( operational effect size) x {sample size) 
or 

/. = A 1 x {sample size), 

where {sample size) here is understood to mean some function of sample size specific to the 
design. The use of software functions to compute power makes it possible to avoid interpolation 
in tables and to automate the computation of a large number of power values (e.g., to examine 
many different design possibilities). 
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The sections that follow show how to use infonnation about effect size, sample sizes at each 
level, ICC, and covariates to compute the statistical power of designs based on assigning schools, 
classrooms, or individuals to treatments. Each section indicates what the operational sample size 
is and how to compute the operational effect size for that design from the sample sizes and ICC 
and covariate infonnation. How each factor influences statistical power and, therefore, how those 
factors might be manipulated to obtain a research design with adequate statistical power are 
shown. References are provided that may help in choosing plausible values of ICCs and 
correlations between covariates and outcomes for use in power calculations. 

In each case, it is assumed that the experiment is planned to have a balanced design with equal 
numbers of individuals within each cluster. It is also assumed that the design is planned to have 
adequate power to compare two treatments (e.g., a treatment intervention and a control 
condition) because this is standard procedure even in designs that involve more than one active 
treatment. Finally, the tenn “school” will be used interchangeably with the term “cluster,” and 
the term “class” or “classroom” will be taken to mean “subclusters” (in three-level designs) 
because these are the most likely terms to be used in education research and because the nesting 
relationship is readily understandable (it is common knowledge in this field that classrooms are 
nested within schools). The designations above are purely a matter of convenience: Nothing in 
this paper requires clusters to be schools or subclusters to be classrooms. 
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Chapter 6: Computing Power in Two-Level Hierarchical Designs 
That Assign Treatments to Clusters 



Consider a two-level hierarchical experiment that will use a total of 2m clusters (typically 
schools) and will assign m of these schools to one treatment condition and m of these schools to 
another treatment condition (such as a control condition). Suppose that the sample size within 
each cluster has the same value n, so that mn individuals are assigned to each treatment and the 
total sample size is N = 2 mn. 

Suppose that the intervention effect at the population level is (jui~ Hi) in the units in which the 
outcome is measured (e.g., test score scale points). To compute statistical power, it is necessary 
to know the standardized intervention effect defined in equation (2), also known as the effect 
size. Suppose that the intraclass correlation coefficient (ICC) is p. Note that in discussions of 
two-level hierarchical designs, the symbol p, with no subscript, is used to denote ICC without 
ambiguity because there is only one possible intraclass correlation. In the discussion of three- 
level hierarchical designs, a subscript “S” or “C” is added to p to indicate the level (school or 
classroom) of the ICC. 



Computing Power for Two-Level Hierarchical Designs With No Covariates 

If the actual number of clusters assigned to each treatment is m, then the power table (Table 1) is 
entered with operational total sample size N r = 2 in. The operational effect size is 



A 1 = S 



n 

1 + («-!) p 



( 4 ) 



where d is the effect size, p is the ICC, and n is the sample size in each cluster. Note that unless p 
= 1 , the design effect 



]jl + (n-l)p 

is always larger than one, so the operational effect size A r is always larger than the actual effect 
size 8. However, the operational sample size N r = 2m is smaller than the actual sample size 2 mn, 
so the power in the design with clustering will not be larger than in the design without clustering. 

Using the operational effect size makes it possible to compute statistical power and sample size 
requirements for analyses based on clustered samples, using these tables or computer programs 
designed for the two-group t test. For example, entering Table 1 on the row given by the 
operational sample size N r , and finding the column corresponding to the operational effect size 
A t , one can read the power value. 
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Using a Computer Program to Compute Statistical Power in Two-Level Hierarchical Designs 
With No Covariates 



An alternative method for computing statistical power in clustered designs is to use a computer 
program that has a built-in function that computes the noncentral /-distribution (Johnson and 
Kotz 1971). To use such a function to compute the statistical power of a test, one must provide 
the program with a value of an index A (related to the operational effect size) called the 
noncentrality parameter — specifically, 



A = 




mn 

2 [ 1 + (n-!)p] 



( 5 ) 



If H(x, v, a) is the cumulative distribution function of the noncentral /-distribution with v degrees 
of freedom and noncentrality parameter A, then the power of the one-tailed test for treatment 
effects at level a is 

pi = 1 - H[c(a, 2m - 2), (2m - 2), A], (6) 

where c(a, v) is the level a one-tailed critical value of the /-distribution with v degrees of 
freedom [e.g., c(0.05, 10) = 1.81]. The power of the two-tailed test at level a is 

P 2 = 1 - H \c(a/2, 2m - 2), (2m - 2), A] + H \-c(a/2, 2m - 2), (2m - 2), A] . (7) 

The Impact of Design Parameters on Power for Two-Level Hierarchical 
Designs With No Covariates 

This fonnulation helps make clear the effects of within-cluster sample size n, number of clusters 
per treatment m, ICC p, and effect size S on statistical power. As Table 1 makes clear, for any 
fixed operational effect size, power increases rapidly as m increases, tending to 1.00 as m 
becomes large. Similarly, for any fixed operational sample size, power increases rapidly as S 
increases, tending to 1.00 as S becomes large. These are basic facts that also apply to a power 
analysis of simple designs. 

The within-cluster sample size n and the ICC p affect power by altering the design effect. The 
impact of p on the design effect is the easiest to see. If p = 0, the design effect is \fn , which is the 
maximum value of the design effect and, therefore, corresponds to the maximum operational 
effect size and the maximum power that can be attained in this design (where the values of n and 
5 are considered fixed). However, as p increases, the design effect, and therefore power, 
decrease. For example, when n = 20, the design effect is 4.47 when p = 0.0, but only 2.63 when p 
= 0.1; 2.04 when/) = 0.2; 1.73 when/) = 0.3; and, of course, the design effect is 1.0 when/) = 1.0. 

To see the impact of the within-cluster sample size n on the design effect, it is useful to rewrite 
the design effect as 
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]jl+(n-l)p 

This fonnulation makes clear that as n increases, the denominator of the design effect becomes 
smaller, so the design effect (and therefore the operational effect size) becomes larger, but only 
to a point. No matter how large n becomes, the design effect can never become larger 
than Vl / p . Moreover, the design effect approaches this limiting value rather quickly. For 

example, if p = 0.20, the largest the design effect can be is V 1/ 0.20 = 2.24, but when n = 10, the 
design effect is already 1.89, and doubling n to n = 20 increases the design effect to only 2.04. 
Any further increases in n can have only very modest impacts on the design effect and therefore 
on power, demonstrating that, beyond a point (which occurs when n is rather modest), obtaining 
a larger sample size by increasing n has little effect on power. 

Example. Consider a design for a study to evaluate the effects of a second-grade supplemental 
reading intervention in which whole schools have been assigned to either receive the treatment 
(intervention) or not. The effectiveness of the intervention will be measured by a post-treatment 
standardized reading test. In this intervention, recruitment comes from a broad range of schools, 
and there is some evidence (e.g., Hedges and Hedberg 2007) that a school-level ICC of about p = 
0.20 is plausible. Note that this assumption is intended for illustration only and is not intended to 
suggest a value that will always be appropriate for other research studies. The study expects that 
at least 10 students will participate from each school, and previous studies of this intervention 
suggest that the effect size is likely to be S = 0.35. The initial plan was for a study that would 
assign m = 30 schools to each condition. To determine the statistical power of this design, the 
operational effect size is first computed, using equation (4), as 

A T = (0.35) — — = (0.35)(1.89) = 0.661 . 

^|1 + ( 10 - 1 ) 0.20 

Entering Table 1 on the row corresponding to N T = 2m = 60 shows that the power for A T = 0.66 
will be between 0.63 (the power for A T = 0.60) and 0.76 (the power for A T = 0.70). Interpolating 
between these two values (0.66 is six-tenths of the way from 0.6 to 0.7, so six-tenths of the way 
between 0.63 and 0.76) obtains a power level of 0.71. 

Because a power level of 0.71 is lower than desired, one might consider altering the design to 
increase power. For example, more students could be recruited from each school. Doing so 
would improve power only slightly. For example, if the number of students per school were 
increased by 50% to n = 15 (but the number of schools is kept constant at m = 30 per treatment 
group), the operational effect size would increase to only A r = 0.695, improving the power to 
only 0.75. If the number of students per school were doubled to n = 20 (but the number of 
schools were held constant at m = 30 per treatment group), the operational effect size would 
increase to only A T = 0.714, improving the power to only 0.77. 

Increasing the number of schools has a much more dramatic effect on power. Increasing the 
number of schools by 50% to m = 45 per treatment group would have no effect on the 
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operational effect size, but power values from the row of the table for N T = 2m = 90 for A T = 0.60 
and 0.70 show that the power is between 0.80 and 0.91. Interpolating six-tenths of the way from 
0.80 to 0.91 yields a power for A T = 0.66 of 0.87. An alternative way to increase power would be 
the use of covariates, which is discussed in the next section. 

Two-Level Hierarchical Designs With Covariates 

Now suppose that the analysis has qs (0 < q s < M— 2) cluster-level covariates and q w (0 < q w < N 
- qs~ 2) individual-level covariates. 1 For example, a design with q w = 1 and qs= 1 might arise if 
a pretest score (centered on the mean cluster score on the pretest) were used as an individual- 
level covariate and cluster means on the covariate were used as a group-level covariate. Note that 
individual-level covariates must be centered on cluster means for the power analyses described 
below to be exactly correct. 

When covariates are used in the design both the operational effect size and the operational 
sample size need to be slightly modified. Table 1 is now entered with operational sample 
size N t 4 = 2m - q s . This decrease in the operational sample size relative to a design without 
covariates reflects the degrees of freedom lost due to the modeling of between-cluster covariates. 

The operational effect size is increased to an extent that depends on how much the covariates 
explain between- and within-c luster variance and how many cluster-level covariates are used. 

Rw is the amount of within-cluster variance the covariates explain and Rs~ is the amount of 
between-cluster variance the covariates explain. Thus, R w and Rs can be thought of as 
proportions of variance (squared multiple correlations between the set of covariates and the 
response) accounted for in the usual way. The covariate-adjusted operational effect size is 






I 2m 


n 


J 2 m-q s ]j 


+ 

1 

i 


Rl. + (nRl-Rl)p_ 



( 8 ) 



Note that the covariate-adjusted design effect implied by equation (8) consists of two distinct 
parts. The first part, ^2m/(2m - q . ) , is a correction tenn that depends on the sample size, taking 

into consideration the number of clusters (2m) and the number of cluster-level covariates q s . This 
tenn is necessary because the number of degrees of freedom used in the t test depends on the 
number of cluster-level covariates modeled, but the noncentrality parameter A does not depend 
on the number of covariates modeled. Note that the value of this factor is usually quite close to 
one. For instance, an experiment with m = 20 clusters assigned to each treatment group, using q s 

= 1 cluster-level covariate, produces ^2mj (2m - q s ) = 1 .0 1 3 , which differs from one by only 

about 1%. 

The second part of the design effect, 



1 Note that the possibility of having 0 (no) covariates at a given level has been included in the previous section, and 
that adding covariates necessitates modifications of the operational sample size and the operational effect sizes. 
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n 



l+(/Z-l)yO 



Ru+(nRs~4)p 



is of a similar form to the unadjusted design effect. Because ^2 m/ (2m - q : ) is so often virtually 
one, the design effect will typically be quite close to the last factor in equation (8). 

To compute power, enter Table 1 on the row given by the operational sample size 
N / = 2m - q, and find the column corresponding to the operational effect size A/ . The power 
value can then be read from the table. 



To use the noncentral /-distribution function to compute the statistical power of a test, one must 
provide the program with a value of the covariate-adjusted noncentrality parameter, 



V 4 





mn / 2 


+ 

TiP 

1 

1 


Rw+(nRs~Rw)p 



( 9 ) 



Power is computed using equation (6) for a one-tailed test or (7) for a two-tailed test, as above, 
except that the covariate-adjusted noncentrality parameter (9) is used with 2m - 2 - q s degrees of 
freedom rather than 2m - 2 degrees of freedom. 

The Impact of Design Parameters on Power for Two-Level Hierarchical Designs With 
Covariates 

As in the case without covariates, it is clear that — for any fixed operational effect size — power 
increases rapidly as m increases, tending to 1.00 as in becomes large. Similarly, for any fixed 
operational sample size, power increases rapidly as S increases, tending to 1.00 as S becomes 
large. 

Moreover, the impact of the within-cluster sample size n, the ICC p, and the covariate-outcome 
(multiple) correlations (R w within and R,s among clusters), occur entirely through the (second 
tenn in the) design effect. The effects of p and n on the design effect are similar to those in the 
design with no covariates. As p increases, the denominator of the design effect increases, the 
design effect decreases, and therefore power decreases. As n increases, the design effect (and 
therefore the operational effect size) becomes larger, but n has only a very modest impact on the 
design effect (and therefore on power) beyond a certain point. Thus, as in the design without 
covariates, beyond a point (which occurs when n is rather modest), obtaining a larger sample size 
by increasing n has little impact on power. However, as n becomes large in the design with 
covariates, the maximum design effect is now 
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instead of y/ll p , as it was in the design without covariates. 

Note that the denominator of the design effect has two terms. The first term. 



1 + (n — 1 )p, 

is the denominator of the design effect in the design without covariates. The second term, 

R» +(n R s- R? v )p_ 



has the same form as [1 + (n - 1 )p\, except that R w ~ replaces 1 and nRs replaces n. Because Rs 
and Rw are typically larger than zero (and cannot be smaller than zero), this second term is 
negative and increases the design effect. If R s 2 = Rw 2 = 0, the covariates have no effect and the 
design effect becomes the same as in the design with no covariates. 

The presence of qs in the denominator of equation (8) at first glance suggests that power can be 
increased simply by using more cluster-level covariates. This advantage is illusory because 

although a larger value of qs will make A 7 , larger, it will also make the operational sample size, 
N t a , smaller. In fact, unless the addition of more cluster-level covariates increases the value of 
R s ", this addition can only harm power, not help it. 

Example. Return to the design considered earlier for a study to evaluate the effects of a second- 
grade supplemental reading intervention where a school-level ICC of about p = 0.20 is plausible, 
n = 10 students would participate from each school, and previous studies of this intervention 
suggest that the effect size is likely to be d = 0.35. Suppose that pretreatment reading test scores 
are available and that both the individual pretest scores and the school means for this pretest will 
be used as covariates in the analysis. Thus, q, =1 and q w = 1 . There is some evidence that values 

of Rw" = 0.5 and Rs = 0.8 are plausible (see Table 3 in Hedges and Hedberg 2007), although we 
caution, as we did in the previous example, that these values are mainly intended for illustration 
and should not necessarily be interpreted as reference values to use when planning other research 
studies. The initial plan was for a study that would assign m = 20 schools to each condition. To 
determine the statistical power of this design, the operational effect size is computed using 
equation (8) as 



=(0.35). 



40 



10 



40 - 1 ^ 1 + (10 - 1) (0.2) - [0.5 + (10 x 0.8 - 0.5)0.2] 



(0.35)(1.013)(3. 536) = 1.253 



Entering Table 1 on the row corresponding to N T A - 2 m — q= 39 yields a power for A T = 1.25 

that will be between 0.95 (the power for A T = 1.20) and 0.98 (the power forzl r = 1.30). 
Interpolating between these two values (1.25 is halfway from 1.2 to 1.3) requires going half of 
the way between 0.95 and 0.98, which yields a power level of 0.965, or 0.96 (to round lower and 
be slightly conservative). 
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This high statistical power might be seen as providing a margin of safety in case any assumptions 
are somewhat optimistic. Alternatively, because a power level of 0.96 may be higher than 
necessary, one might consider altering the design to decrease costs while maintaining acceptable 
statistical power. For example, one might consider decreasing the number of schools to m =15 
per treatment group. That would have very little effect on the operational effect size (it would 
now be 1 .259), but reading power values from the row of the table for N T A - 2m - q s = 29 for A T 

= 1.20 and 1.30 shows that the power is between 0.88 and 0.92. Interpolating six-tenths of the 
way from 0.89 and 0.93 gives a power for A T = 1.24 of 0.90. Comparing this value with that 
derived in the last section for the same study without covariates, one sees that a design with m = 

1 5 schools using a pretest as a covariate has greater power than a design with m = 45 schools 
(three times as many schools) using no covariates (assuming the R 2 values used are accurate). 
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Chapter 7: Computing Statistical Power in Three-Level 
Hierarchical Designs 

Consider a three-level hierarchical experiment that will use a total of 2m clusters (typically 
schools) and will assign m of these schools to a treatment condition and m of these schools to a 
control condition. Suppose that each school has a total of p subclusters (usually classrooms) and 
that the sample size within each classroom has the same value n, so that the total sample size is N 
= 2 mpn. 

Suppose that the intervention effect at the population level is (juj - p2) in the units in which the 
outcome is measured (e.g., test score scale points). In three-level designs, statistical power also 
depends on the intervention effect via the effect size or standardized intervention effect 
(sometimes called Cohen’s d): 

^ _ Mi ~ Mi 
o T 



where o T is the total population standard deviation of the outcome. 



In three-level models, two indices are necessary to characterize the relationship between the 
component variances that make up the total variance, and they are generalizations of the 
intraclass correlation coefficient (ICC). Let ot = os' + <Jc + ow be the total variance, where os' 
is the between-cluster (e.g., between-school) variance, cr c is the between-subcluster but within- 
cluster (e.g., between-classroom within-school) variance, and ow is the within-subcluster (e.g., 
within-classroom) variance. Define the school-level ICC p^by 
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( 10 ) 



Similarly, define the classroom-level ICC pc by 



PC = 



2 



2 2 2 
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( 11 ) 



Together, the ICCs p s and p c define the clustering structure in the three-level experiment. 



Three-Level Hierarchical Designs With No Covariates 

Using the same ideas as for two-level hierarchical designs, one can compute power for three- 
level hierarchical designs by using the appropriate operational sample size and operational effect 
size. 
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Because the actual number of clusters assigned to each of the two treatments is m, we enter the 
power table with operational total sample size N T = 2m. The operational effect size is 




where d is the effect size, ps is the cluster-level (school) ICC, pc is the subcluster-level 
(classroom) ICC, p is the number of subclusters (classrooms) per cluster (school), and n is the 
number of individuals in each subcluster (classroom). Note that if 

Pn~\ ^ P c 

n ~ 1 1 ~p/ 

the design effect is greater than one. Because it will generally be the case that p c < 0.5 and 
p s < 0.5 , the design effect will usually be greater than one, so the operational effect size A 1 is 

usually larger than the actual effect size §. However, the operational total sample size N 1 is 
smaller than the actual total sample size 2 mpn, so the power in the design with clustering will be 
smaller than in the design without clustering. 

Using the operational effect size makes it possible to compute statistical power and sample size 
requirements for analyses based on clustered samples using tables and computer programs 
designed for the independent groups t test. For example, one can read the power value by 
entering Table 1 on the row given by the operational sample size N T and finding the column 
corresponding to the operational effect size zl 7 . 

Computing Statistical Power for Noncentral t-Distributions Using Computer Programs 

An alternative method for computing statistical power in three-level hierarchical designs is to use 
a computer program that has a built-in function that computes the noncentral /-distribution. To 
use such a function to compute the statistical power of a test, one must provide the program with 
a value of the noncentrality parameter — specifically, 




Power is computed using equation (6) for a one-tailed test or (7) for a two-tailed test, as above. 
Note that the degrees of freedom used is the same as in the two-level hierarchical design; that is, 
the degrees of freedom do not depend on the number of subclusters sampled (Konstantopoulos 
2008b). 

The Impact of Design Parameters on Power for Three-Level Hierarchical Designs With No 
Covariates 
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This formulation helps make clear the effects of the number of clusters m, the number of 
subclusters p, within-subcluster sample size n, ICCs ps and pc, and effect size S on statistical 
power. As in the case of two-level designs, for any fixed operational effect size, power increases 
rapidly as m increases, tending to 1.00 as in becomes large. Similarly, for any fixed operational 
sample size, power increases rapidly as d increases, tending to 1 .00 as d becomes large. 

The impact of the number of subclusters per cluster p, the within-subcluster sample size n, and 
the ICCs ps and pc occurs entirely through the design effect. The impact of ps and pc on the 
design effect is the easiest to see. If ps = pc = 0, the design effect is yfpn , which is the maximum 
value of the design effect and therefore corresponds to the maximum operational effect size and 
the maximum power that can be attained in this design. However, as either ps or pc increases, the 
design effect — and therefore power — decreases. Furthermore, an increase in ps will have a more 
deleterious effect on power than will a similarly sized increase in p c . For example, when n = 20 
and p = 3, the design effect is 7.75 when p s = p c = 0 ; 4.55 when p s = 0 and p c = 0.1; 2.99 when 
p s = 0 and p c = 0.3 ; and 1.73 when p. = 0 and p c = 1 . Similarly, when p c = 0 and p s = 0.1 , the 
design effect decreases from 7.75 to 2.95; when p c = 0 and p s = 0.3 , the design effect is 1.79; 
and when p c = 0 and p s = 1 , the design effect is, of course, one. 

To see the impact of the within-subcluster sample size n on the design effect, it is useful to 
rewrite the design effect as 



I pn I 1 

\l + {pn-\)p s +(n-l)p c y-^ + (l—^)Ps+(j-£)p c ' 

This fonnulation makes clear that as n increases, the denominator of the design effect (as 
expressed on the right, above) becomes smaller, so the design effect (and therefore the 
operational effect size) becomes larger, but only to a point. No matter how large n becomes, the 
design effect can never become larger than 



n~ . 

\Ps+j,Pc 

Moreover, the design effect approaches this limiting value rather quickly. For example, if ps = 
0.20, pc = 0.10, and p = 2, the largest the design effect can be is ^l/[0.20 + (0.10/2)] = 2.00 , but 

when n = 10, the design effect is already 1.87, and doubling n to n = 20 increases the design 
effect to only 1.93. Any further increases in n can have only a very modest impact on the design 
effect and therefore on power. This explication demonstrates that beyond a point (which occurs 
when n is rather modest), obtaining a larger sample size by increasing n has little impact on 
power. 

Example. Return to the design considered earlier for a study to evaluate the effects of a second- 
grade supplemental reading intervention. Remember that the assumptions made here are intended 
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for illustration only and are not intended to suggest values that will always be appropriate for 
other research studies. Continue to assume that a school-level ICC of about ps = 0.20 is plausible 
and assume that a classroom-level ICC of roughly p c = 0. 13 is plausible. Assume, in addition, 
that n = 10 students from each of p = 2 classrooms would participate from each school. Previous 
studies of this intervention suggest that the effect size is likely to be S = 0.35. The initial plan 
was for a study that would assign m = 30 schools to each condition. To determine the statistical 
power of this design, first compute the operational effect size using equation (12): 



A r 



(0.35) 



2 ( 10 ) 

1 + [2(1 0) - 1](0.20) -t( 10-1) (0.13) 



= (0.35)(1.830) = 0.641. 



T T 

Entering Table 1 on the row corresponding to N = 2m = 60, one sees that the power for A = 

0.64 will be between 0.63 (the power forzl r = 0.60) and 0.76 (the power for A T = 0.70). 
Interpolating between these two values (0.64 is four-tenths of the way from 0.6 to 0.7, going 
four-tenths of the way between 0.63 and 0.76) yields a power level of 0.68. 

Because a power level of 0.68 is lower than desired, one might consider altering the design to 
increase power. Similar to the case of the two-level hierarchical design, neither increasing the 
number of students, n, per classroom, nor increasing the number of classrooms, p, per school 
(even if feasible) would have a substantial effect on statistical power. 

Increasing the number of schools, however, has a much more dramatic effect on power. 
Increasing the number of schools by 50% to m = 45 per treatment group would have no effect on 
the operational effect size, but power values from the row of the table for N r = 2m = 90 for A T = 
0.60 and 0.70 show that the power is between 0.80 and 0.91. Interpolating four-tenths of the way 
from 0.80 to 0.91, one gets a power of approximately 0.84 for A T = 0.64. Using covariates also 
can increase power; this method is discussed in the next section. 

Three-Level Hierarchical Designs With Covariates 

Now suppose that there are qs (0 < qs < 2m - 2) cluster-level covariates, q<; (0 <q c < 2mp — qs — 
2) subcluster-level covariates, and qw( 0 < qw K N- q s ~ qc~ 2) individual-level covariates in the 
analysis. 2 For example, a design with q w = \,qc= 1, and qs = 1 might arise if a pretest were used 
(centered on subcluster means) as an individual-level covariate; subcluster means (centered on 
cluster means) on the pretest were used as a subcluster-level covariate; and cluster means on the 
pretest were used as a cluster-level covariate. The centering of covariates on higher-level means 
is again crucial for the power computations described below to be exact. 

When covariates are used in the design, both the operational effect size and the operational 
sample size are slightly modified. As with the two-level design, operational sample size 
N t 4 = 2 m-q s is entered into table l.This decrease in the operational sample size relative to a 



2 Note that the possibility of having 0 (no) covariates at a given level has been included in the previous section, and 
that adding covariates necessitates modifications of the operational sample size and the operational effect sizes. 
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design without covariates reflects the degrees of freedom lost due to the modeling of between- 
cluster covariates. 



The operational effect size increases to an extent that depends on how much the covariates 
explain between-cluster, between-subcluster but within-cluster, and within-subcluster variance. 
Rw is the amount of within-subcluster variance the covariates explain, Rs~ is the amount of 
between-cluster (between-school) variance the covariates explain, and Re is the amount of 
between-subcluster but within-cluster (between-classroom but within-school) variance the 
covariates explain. One can think of Rw , Rs 2 , and Re as proportions of variance accounted for 
(squared multiple correlations) in the usual way. The covariate-adjusted operational effect size is 

r ~ (14) 

1 + (pn ~ 1 ) P s + (n - 1 )p c - [_ R w + (P nR l ~ R w )Ps + 0 lR c ~ R w )Pc _ 




In complete analogy with the two-level hierarchical design, the covariate-adjusted design effect 
implied by equation (14) consists of two distinct parts: a correction tenn depending on the 
operational sample size and a second tenn that contains the information about the effects of 
clustering and adjustment for covariates on the operational effect size. Because the first factor is 
again generally quite close to one, the design effect will be quite close to the last factor in 
equation (14). 

Because the tenn in square brackets in the denominator of the (second term in the) design effect 
is never less than zero and is generally positive, the covariate-adjusted operational effect size A/ 
is generally larger than the unadjusted operational effect size A T . In computing A/ , it may be 
more convenient to break the denominator of the (second term in the) design effect into two 
parts. One part 

A = 1 + (pn - l)p s + (n - 1 )p c 
reflects the impact of clustering, and the second part 

B = R w + ( pnRs 2 - Rw" )ps + (nRc~ - R w ) pc 
reflects the adjustment for the effects of covariates, so that 



(15) 

Entering Table 1 on the row given by the operational sample size N 1 a and finding the column 
corresponding to the operational effect size A/, one can read the power value. 

To use the noncentral /-distribution function to compute the statistical power of a test, one must 
provide the program with a value of the covariate-adjusted noncentrality parameter, 
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for illustration only and are not intended to suggest values that will always be appropriate for 
other research studies. Continue to assume that a school-level ICC of about ps = 0.20 is plausible 
and assume that a classroom-level ICC of roughly p c = 0. 13 is plausible. Assume, in addition, 
that n = 10 students from each of p = 2 classrooms would participate from each school. Previous 
studies of this intervention suggest that the effect size is likely to be S = 0.35. The initial plan 
was for a study that would assign m = 30 schools to each condition. To determine the statistical 
power of this design, first compute the operational effect size using equation (12): 



A r 



(0.35) 



2 ( 10 ) 

1 + [2(1 0) - 1](0.20) + (1 0 - 1)(0. 1 3) 



= (0.35)(1.830) = 0.641. 



Entering Table 1 on the row corresponding to N T = 2m = 60, one sees that the power forzl 7 = 
0.64 will be between 0.63 (the power forzl r = 0.60) and 0.76 (the power for A T = 0.70). 
Interpolating between these two values (0.64 is four-tenths of the way from 0.6 to 0.7, going 
four-tenths of the way between 0.63 and 0.76) yields a power level of 0.68. 

Because a power level of 0.68 is lower than desired, one might consider altering the design to 
increase power. Similar to the case of the two-level hierarchical design, neither increasing the 
number of students, n, per classroom, nor increasing the number of classrooms, p, per school 
(even if feasible) would have a substantial effect on statistical power. 

Increasing the number of schools, however, has a much more dramatic effect on power. 
Increasing the number of schools by 50% to m = 45 per treatment group would have no effect on 
the operational effect size, but power values from the row of the table for N r = 2m = 90 for A T = 
0.60 and 0.70 show that the power is between 0.80 and 0.91. Interpolating four-tenths of the way 
from 0.80 to 0.91, one gets a power of approximately 0.84 for A T = 0.64. Using covariates also 
can increase power; this method is discussed in the next section. 



Three-Level Hierarchical Designs With Covariates 

Now suppose that there are qs (0 < qs < 2m - 2) cluster-level covariates, q<; (0 <q c < 2mp — qs — 
2) subcluster-level covariates, and qw ( 0 < qw K N- q s ~ qc~ 2) individual-level covariates in the 
analysis. 2 For example, a design with q w = \,qc= 1, and qs = 1 might arise if a pretest were used 
(centered on subcluster means) as an individual-level covariate; subcluster means (centered on 
cluster means) on the pretest were used as a subcluster-level covariate; and cluster means on the 
pretest were used as a cluster-level covariate. The centering of covariates on higher-level means 
is again crucial for the power computations described below to be exact. 

When covariates are used in the design, both the operational effect size and the operational 
sample size are slightly modified. As with the two-level design, operational sample size 
N t 4 = 2 m-q s is entered into table l.This decrease in the operational sample size relative to a 



2 Note that the possibility of having 0 (no) covariates at a given level has been included in the previous section, and 
that adding covariates necessitates modifications of the operational sample size and the operational effect sizes. 
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classes in each school, and previous studies of this intervention suggest that the effect size is 
likely to be S = 0.35. Suppose that pretreatment reading test scores are available and that 
classroom-centered individual pretest scores, school-centered classroom mean pretest scores, and 
school mean pretest scores will be used as covariates. Thus, q w = q c = q s = \ . There is some 

evidence that values of Rw = 0.5, R c 2 = 0.6, and = 0.8 are plausible (Hedges and Hedberg 
2007). These values are again mainly intended for illustration and should not necessarily be 
interpreted as reference values to use when planning other research studies. 

The initial plan was for a study that would assign m = 30 schools to each condition. To determine 
the statistical power of this design, one first computes the operational effect size, using equation 
(15), with 

A = 1 + [ (2)( 1 0) - 1](0.2) + (10 - 1)(0. 13) = 5.970 
and 

B = 0.5 + [(2)(10)(0.8) - 0.5](0.2) + [(10)(0.6) - 0.5](0.13) = 4.315, 



so that 



A t a = 0.35, — J = (0.35)(1.01)(3.476) = 1.227 . 

A V59V5.970-4.315 

Entering Table 1 on the row corresponding to N T = 2m = 60, one sees that the power for A 7 = 1.2 
is listed as 1.00, so the power for A r = 1.23 is at least 0.995. 

One might regard this high statistical power as providing a margin of safety in case any 
assumptions are somewhat optimistic. Alternatively, because power of 0.995 may be higher than 
necessary, one might consider altering the design to decrease costs while maintaining acceptable 
statistical power. For example, one might consider decreasing the number of schools to m =15 
per treatment group. That decrease would have very little effect on the operational effect size (it 
would change to 1.238), but reading power values from the row of Table 1 for N T = 2m-q = 30 
for A T = 1.2, one sees that the power is at least 0.89. 

Comparing this value with that derived in the last section for the same study without covariates, 
we see that a design with m =15 schools, using a pretest as a covariate, has higher power than a 
design with m = 45 schools (three times as many schools), using no covariates (again, assuming 
the R 2 values used are accurate). 
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Chapter 8: Computing Power in Randomized-Block Designs 



Computing Power in Two-Level Randomized-Block Designs 
That Assign Treatments Within Clusters 

Consider a two-level experiment that uses a total of m clusters (typically schools), with 2 n 
individuals in each cluster. Unlike the hierarchical design that assigns whole clusters (e.g., 
schools) to treatments, the experiment assigns some individuals within each cluster to each of 
two treatments. That is, within each of the m schools, n individuals are assigned to each 
treatment, so that mn individuals are assigned to each treatment and the total sample size is N = 

2 mn. 

Suppose that the intervention effect at the population level is (jui - n 2) in the units in which the 
outcome is measured (e.g., test score scale points). Rather than this unstandardized effect, 
statistical power is computed on the basis of the effect size or standardized intervention effect 
(sometimes called Cohen’s d): 

^ _ Mi ~ Mi 
a 

where a is the total population standard deviation of the outcome within treatment groups. That 

iS, (7 <J S + CTfy . 

In the randomized-block design, as in the hierarchical design, power depends on the cluster-level 
intraclass correlation coefficient (ICC),/? = a 2 s /(cr 2 s + cr^) . However, in this design, power also 

depends on the degree to which treatment effects vary across clusters. It is convenient to 
characterize this treatment effect heterogeneity via the parameter co, which represents the 
proportion of between-cluster variability that is attributable to heterogeneity of treatment effects. 
Thus, co can be characterized as 



co - <Jtxs~!gs~ , (17) 

where &txs 2 is the variance due to the treatment by cluster interaction and as" is the total cluster 
level variance. When the usual statistical model is used for power analysis, under very mild 
additional assumptions otxs is less than as ' , so that co is almost always less than one and can be 
as small as zero in cases where the treatment effect is very similar across clusters. For example, a 
class size experiment involved kindergarten through fourth-grade students in 79 elementary 
schools in 42 school districts in Tennessee. The experiment had a randomized-block design, 
assigning treatments to classrooms within schools. Nye, Hedges, and Konstantopoulos (2000) 
estimated the between-school variance of small-class effects and found that variation of 
treatment effects was reasonably small for most grades and subject matter (with an average co 
value of about 0.3). However, it is possible for co to be larger, particularly when treatment 
implementation may vary widely. 
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Two-Level Randomized-Block Designs With No Covariates 

If the experiment has in clusters, enter the one-sample t test power table with operational sample 
size N T = m . The operational effect size is 




where d is the effect size, p is the ICC, co is one half of the ratio of variance of the treatment 
effects across clusters to the total cluster level variance, and n is the sample size in each cluster. 
Note that the design effect 

I n 

Y 2[l + (wcw — l)p] 

can be larger than one, so the operational effect size A T can be larger than the actual effect size S. 
However, the operational sample size N T = m is smaller than the actual within-treatment sample 
size inn, so the power in the design with clustering will usually not be larger than in the design 
without clustering. 

Comparing the operational effect size in the randomized-block design with that in the two-level 
hierarchical designs having the same total sample size, and noting that co is usually smaller than 
one, one can see that the operational effect size A T is usually larger in the randomized-block 
design. 

Using the operational effect size makes it possible to compute statistical power and sample size 
requirements for analyses based on clustered samples using tables and computer programs 
designed for the one-sample t test. For example, entering Table 2 on the row given by the 
operational sample size N r , and finding the column corresponding to the operational effect size 
A t , one can read the power value. 

Using a Computer Program to Compute Statistical Power in Two-Level Randomized-Block 
Designs With No Covariates. 

An alternative method for computing statistical power in clustered designs is to use a computer 
program that has a built-in function that computes the noncentral /-distribution. To use such a 
function to compute the statistical power of a test, one must provide the program with a value of 
an index 2 (related to the operational effect size) called the noncentrality parameter, specifically 
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If H(x, v, A) is the cumulative distribution function of the noncentral /-distribution with v degrees 
of freedom and noncentrality parameter A, then the power of the one-tailed test for treatment 
effects at level a is 

pi = 1 — H[c(a, in - 1), (m - 1), A], (20) 

where c(a, v ) is the level a one-tailed critical value of the /-distribution with v degrees of 
freedom [e.g., c(0.05,10) = 1.81]. The power of the two-tailed test at level a is 

P 2 — 1 - H \c(a/2, m - 1), ( m - 1), A] + H \-c(a/2, m - 1), (m - 1), A] . (21) 

These results are given in a slightly different notation in Raudenbush and Liu (2000). 

The Impact of Design Parameters on Power in Two-Level Randomized-Block Designs With 
No Co variates. 

This fonnulation helps to clarify the effects of within-cluster sample size 2 n, number of clusters 
m, ICC p, the heterogeneity parameter (o, and effect size S on statistical power. As Table 2 
shows, for any fixed operational effect size, power increases rapidly as m increases, tending to 
1.00 as m becomes large. Similarly, for any fixed operational sample size, power increases 
rapidly as S increases and tends to 1.00 as S becomes large. These are basic facts from the power 
analysis of simple designs. 

The within-cluster sample size n, the ICC p, and the heterogeneity parameter oj impact power 
because they impact the design effect. The effect of the heterogeneity parameter can be 
profound. If treatment effects are perfectly consistent across clusters so that co = 0, the design 
effect is A Jn / [2(1 - /?)] , which is the maximum value of the design effect for fixed values of n 
and p and therefore corresponds to the maximum operational effect size and the maximum 
power that can be attained in this design. However, as the heterogeneity of treatment effects 
increases, the design effect (and therefore power) decreases. 

To see the impact of the within-cluster sample size n on the design effect, it is useful to rewrite 
the design effect as 



This fonnulation makes clear that as n increases, the denominator of the design effect becomes 
smaller, so the design effect (and therefore the operational effect size) becomes larger, but only 
to a point. No matter how large n becomes, the design effect can never become larger 
than yJ\/[2cop] . Moreover, the design effect approaches this limiting value rather quickly. For 

example, if p = 0.20 and co = 0.5 , the largest the design effect can be is ^1/ [2(0.20)0.5] = 2.24 , 
but when n = 10, the design effect is already 1.67 and doubling n to n = 20 increases the design 
effect to only 1.89. Any further increases in n can have only a very modest impact on the design 
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effect and therefore on power, demonstrating that, beyond a point (which occurs when n is rather 
modest), obtaining a larger sample size by increasing n has little impact on power. 

Example. Return to the problem of designing a study to evaluate the effects of a second-grade 
supplemental reading intervention. Suppose that it is reasonable to believe that this supplemental 
reading intervention might be administered to some students and not to others in the same school 
without fear of contamination. Thus, a two-level randomized-block design is used. It is still the 
intention to recruit from a broad range of schools so that a school-level ICC of about p = 0.20 is 
plausible (e.g., Hedges and Hedberg 2007). One expects that at least n = 10 students will 
participate in each treatment group from each school, and previous studies of this intervention 
suggest that the effect size is likely to be S = 0.35 and that effects are likely to be fairly consistent 
across schools, so that a value of co = 0.5 is plausible (even fairly conservative). The initial plan 
was for a study that would involve m = 30 schools. Recall that these assumptions are intended 
for illustration only and are not intended to suggest values that will always be appropriate for 
other research studies. 

To determine the statistical power of this design, one first computes the operational effect size, 
using equation (18): 



€ t = (0.35) I — — = — T = (0.35)(1.667) = 0.583 . 

^2{l + [(10)(0.5)-l]0.20} 

Entering Table 2 on the row corresponding to N T = m = 30, one sees that the power for A T = 0.58 
will be between 0.75 (the power forJ T = 0.50) and 0.89 (the power forzl r = 0.60). Interpolating 
between these two values (0.58 is eight-tenths of the way from 0.5 to 0.6, so one needs to go 
eight-tenths of the way between 0.75 and 0.89), one obtains a power level of 0.86. 

Note that the power of a two-level hierarchical design that assigned the same number of students 
mn = (30)( 1 0) = 300 to each treatment group (but used twice as many schools because there were 
half as many individuals per school) was considerably lower (only 0.71). This example illustrates 
how much higher the power of a randomized-block design may be if the design can be used. 

Note that this power calculation is somewhat sensitive to the value of the parameter co that 
describes the heterogeneity of treatment effects across clusters. If co had been twice as large (that 
is, co = 1.0), the operational effect size would have been only A 7 = 0.47, and the power would 
have been only approximately 0.69. 
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Two-Level Randomized-Block Designs With Covariates 

Now suppose that there are qs (0 < qs < m - 1 ) cluster-level covariates and qw(0 <qw<N- qs~ 

•5 

1) individual-level covariates in the analysis. For example, a design with q w = 1 and qs= 1 
might arise if a pretest were used as an individual-level covariate (centered on cluster means) and 
cluster means on the covariate were used as a group-level covariate. 

When covariates are used in the design, both the operational effect size and the operational 
sample size are slightly modified. As was the case in hierarchical designs, the operational sample 
size is decreased relative to a design without covariates to reflect the degrees of freedom lost due 
to the modeling of between-cluster covariates. In the two-level randomized-block design, one 
enters Table 2 with operational sample size 

As was the case in hierarchical designs, the operational effect size is increased when covariates 
are used, but the nature of this modification is somewhat different. In the randomized-block 
design, cluster-level covariates increase power primarily if they explain part of the variance in 
treatment effects across clusters (that is, if they explain part of the cluster by treatment 
interaction variance). Contrary to the hierarchical design, cluster-level covariates that explain 
only variation among cluster means will have no effect on power in the randomized-block 
design. Rw 2 is the amount of within-cluster variance the covariates explain, and Rts is the 
amount of between-cluster variance in treatment effects the covariates explain. One can think of 
Rw and Rrs as proportions of variance accounted for (squared correlations) in the usual way. 

The covariate-adjusted operational effect size is 



Af = 8 . 



m 



m-q s 





nil 


^l + (rao-l)/9- 


R; v +(nojR;. s -R 2 w )p 



( 22 ) 



Note that the covariate-adjusted design effect implied by equation (22) consists of two distinct 
parts. The first part, (m - q s ) , is a correction tenn that depends on the sample size (number 

of clusters m) and the number of cluster-level covariates q s . This term is necessary because the 
degrees of freedom used in the t test depend on the number of cluster-level covariates modeled, 
but the noncentrality parameter X does not depend on the number of co variates modeled. Note 
that the value of this factor is usually quite close to one. For instance, in an experiment with m = 

40 total clusters and q s = 1 cluster-level covariate used, Jmj (m - q . ) = 1.013 , so this factor 

differs from one by only about 1%. 

The second part of the design effect, 




l )P 



nil 

Rw +(ncoRf ~Rf)p 



3 Note that the possibility of having 0 (no) covariates at a given level has been included in the previous section, and 
that adding covariates necessitates modifications of the operational sample size and the operational effect sizes. 
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is of a similar form to the unadjusted design effect d 7 Because ^mj (m - q s ) is so often virtually 
one, the design effect will typically be quite close to the last factor in equation (22). 

The covariate-adjusted operational effect size is generally larger than the unadjusted design 
effect A T because Rts~ and R w are generally positive; hence, the tenn in square brackets in the 
denominator of the second term of the design effect is usually negative. 

Using the operational effect size makes it possible to compute statistical power and sample size 
requirements for analyses based on clustered samples using the tables and computer programs 
designed for the one sample t test. For example, entering Table 2 on the row given by the 
operational sample size N T , and finding the column corresponding to the operational effect size 
A t , one can read the power value. 

Using a Computer Program to Compute Statistical Power in Two-Level Randomized-Block 
Designs With Covariates. 

An alternative method for computing statistical power in clustered designs is to use a computer 
program that has a built-in function that computes the noncentral /-distribution. To use such a 
function to compute the statistical power of a test, one must provide the program with a value of 
an index X (associated with the operational effect size) called the noncentrality parameter — 
specifically, 



= yj m -<ls^ T A = 5 





mn / 2 


|l + («<y-l)yO- 


R„ + ( nojRj S -R^)p 



(23) 



Power is computed using equation (20) for a one-tailed test or (21) for a two-tailed test, as above, 
except that the covariate-adjusted noncentrality parameter (23) is used with m - 1 - qs degrees of 
freedom (Raudenbush and Liu 2000). 

Example. Returning to the problem of designing a study to evaluate the effects of a second-grade 
supplemental reading intervention, suppose that it is reasonable that this intervention might be 
administered to some students and not to others in the same school, without fear of 
contamination. In this case, a two-level randomized-block design would be used. One would still 
intend to recruit from a broad range of schools so that a school-level ICC of about p = 0.20 is 
plausible (Hedges and Hedberg 2007). Again, suppose that at least n= 10 students would 
participate in each treatment group from each school. Previous studies of this intervention 
suggest that the effect size is likely to be § = 0.35 and that effects are likely to be fairly consistent 
across schools, so that a value of co = 0.5 is plausible (even fairly conservative). We continue to 
assume that the values Rw =0.5 and Rs =0.8 are plausible. We also assume that about half of 
the total variance between schools that the covariates explain is due to the ability of the 
covariates to predict treatment effect variability, so that Rrs = 0.4. We again clarify that these 
assumptions are intended for illustration only and are not intended to suggest values that will 
always be appropriate for other research studies. The initial plan was for a study that would 
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involve m = 30 schools. To determine the statistical power of this design, one first computes the 
operational effect size, using equation (22): 



10/2 

1 + [(10)(0.5) - 1] (0.2) - (0.5 + [(10)(0.5)(0.4) - 0.5](0.2)} . 

= (0.35)(1.017)(2.236) = 0.80 

Entering Table 2 on the row corresponding to N T = m - 1 = 29, one sees that the power for A T = 
0.80 will be indistinguishable from 0.99. 

Computing Statistical Power in Three-Level Randomized-Block Designs 

In three-level designs, the sampling involves clusters (such as schools) and subclusters (such as 
classrooms). In three-level randomized-block designs, both treatments are given to some 
individuals within every school. There are two variations of the three-level randomized-block 
experiment. One variant assigns subclusters (e.g., classrooms) to treatments, so that every 
individual within the same subcluster (classroom) receives the same treatment, but different 
subclusters within the same cluster (different classrooms within the same school) receive 
different treatments. The other variant assigns individuals within subclusters (classrooms) to 
treatments, so that some individuals receive each treatment in every classroom. 

Suppose that the intervention effect at the population level is (juj - pi) in the units in which the 
outcome is measured (e.g., test score scale points). Statistical power again depends on the 
intervention effect via the effect size or standardized intervention effect (sometimes called 
Cohen’s d) 

^ _ /h ~ P2 
a 

where a is the total population standard deviation of the outcome within-treatment groups. That 

is, 




2 2 , 2,2 

a - a s +<j c + a w . 

In three-level designs, recall that two indices — which were defined as the school-level (cluster) 
ICC ps (defined in equation [10] above) and the classroom-level (subcluster) ICC pc (defined in 
equation [11] above) — are necessary to characterize the intraclass correlation structure. 

In two-level randomized-block designs, it was observed that power also depends on the degree to 
which treatment effects vary across clusters. This observation is also true in three-level 
randomized-block designs. In three-level randomized-block designs that assign treatments within 
subclusters (e.g., classrooms), power also depends on the degree to which treatment effects vary 
across subclusters. The treatment effect heterogeneity across clusters is characterized via the 
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ratio, cos, of variation of the treatment effects across clusters to the variation of untreated cluster 
means. Thus, cos can be characterized via 

OiS ~ OtxS / oi (24) 

where aj x s~ is the treatment by cluster interaction and as is the variance of the cluster means. 
Similarly, in the case of randomized-block designs that assign treatments within subclusters, the 
treatment effect heterogeneity across subclusters is characterized via the ratio, coc, of variation of 
the treatment effects across subclusters to the variation of untreated subcluster means. Thus, coc 
can be characterized via 



coc ~ otxcIoc, (25) 

where otxc is the treatment by subcluster interaction variance and ac is the variance of the 
untreated subcluster means within clusters. 

As in the case of two-level randomized-block designs, when the usual statistical model is used 
for power analysis, under very mild additional assumptions ar x s~ is less than as , so that cos is 
usually less than one and can be as small as zero in cases where the treatment effect is very 
similar across clusters. In the same way, when the usual statistical model is used for power 
analysis, under very mild additional assumptions arxc 2 is less than ac , so that coc is usually less 
than one and can be as small as zero in cases where the treatment effect is very similar across 
subclusters. 

Power computations for three-level randomized-block designs are quite similar to power 
computations for two-level randomized-block designs and two- and three-level hierarchical 
designs. In each case, power computation begins with computing a design effect and then using 
that design effect to compute the operational effect size. This operational effect size is then used 
along with the operational sample size to obtain the statistical power from tables of the power of 
the one-sample t test (such as Table 2). The computation of power for these designs is described 
in detail in Appendix C. 
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Chapter 9: Conclusions 



This paper has provided an introduction to statistical power analysis in complex designs 
involving two- or three-stage cluster sampling. It shows how to use the concepts of operational 
effect sizes and sample sizes to compute statistical power using the power tables constructed for 
simple sampling designs. This formulation also explains how the clustering structure described 
by intraclass correlation coefficients (ICCs) influences operational effect size and, therefore, 
statistical power. Additionally, this paper details how the use of covariates can increase power by 
increasing the operational effect size. 

Several general conclusions about design follow. The first is that in hierarchical designs, 
clustering will always decrease statistical power compared with a design that does not involve 
clustering. The difference in power will be determined by the design effect, which is a function 
of the experimental design and the ICC (e.g., of p). We note that the design effect we define 
differs from the usual design effect described in Kish (1965) and elsewhere. For example, in a 

two-level hierarchical design, the design effect would be yfn if there were no clustering, but it 
will never be larger than ^JTTp in a design with clustering — no matter how large the within- 
cluster sample size (n) becomes. The design effect (and the power) approach the maximum when 
n is quite modest, so that increasing n beyond this point has little effect on power. In making 
decisions about allocation of sample size, it is therefore better (in terms of statistical power) to 
have a larger number of clusters than a larger number of individuals within clusters. The use of 
covariates can increase power substantially and, therefore, is always desirable. 

Power for randomized-block designs may be computed in a fashion similar to hierarchical 
designs. When they are feasible, randomized-block designs will always have higher power than 
hierarchical designs for the same sample size. In randomized-block designs, the power depends 
on the heterogeneity of treatment effects across clusters (blocks), and this influence can be 
profound. Thus, the power advantage of randomized-block designs is most substantial when 
treatment effects are reasonably homogeneous across clusters. In fact, in randomized-block 
designs where treatment effects are homogeneous across clusters, power will generally be greater 
than the power in designs that do not involve clustering. 
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Appendix A: Design Effects in Two- or Three-Level Hierarchical Designs With and Without Covariates 



No Covariates 


With Covariates 


Two-level hierarchical design 


1 n 


1 2m 1 n 


yi + (n-l)p 


V 2m - Vs ]] 1 + (n - 1) p - [K + ( < - K ) p] 




Three-level hierarchical design 






1 P n 


1 2m 1 pn 


\\ + (pn-\)p s +(n-\)p c 


\2m-q s ]jl + (pn-l)p s +(n-\)p c +(pnR; -Rl)p s +(nR 2 c -R 2 w )p c ~\ 



Note: See the text for definitions of symbols used in this table. They can be found on the following pages: 



Design Effects in Two-Level Hierarchical Designs Without Covariates 
Pages 17 

Design Effects in Two-Level Hierarchical Designs With Covariates 
Pages 20-21 and Formula 8 

Design Effects in Three-Level Hierarchical Designs Without Covariates 
Pages 25-26 and Formula 12 

Design Effects in Three-Level Hierarchical Designs With Covariates 
Pages 28-29 and Formula 14 
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Appendix B: Design Effects in Two- or Three-Level Randomized-Block Designs With and Without 
Covariates 



No Covariates 


With Covariates 


Two-level randomized block design 


1 ” 


1 m 


( n/2 


]] 2\\ + {nco-\)p] 


\ m ~Vs \ 


\l + (nco-\)p- R 2 W + [na>Rj S - R 2 V ) p 



Three-level randomized block design assigning treatments to subclusters 



pn 



1 2 [l + {pnco s - 1) p s + (n - 1 )p c ] 



m 



pn 1 2 



m - 1 



Vs \| 1 + ( pno) s \^p s +{n 1 )/>(■ R w +(pnco s R TS R w j p s + ^ nR c R w j p c 



Three-level randomized block design assigning treatments to individuals 



pn 



' 2 [l + ( pnco s - 1) p s + ( nco c - \)p c ] 



in 



pn 1 2 



' m - 1 



Vs V 1 + ( pnco s 1 ) p s + (nop- 1 )p c R n + ( pnoj s R rs R w j p s + i^noj c R rc R w j p (: 



Note: See the text for definitions of symbols used in this table. They can be found on the following pages: 



Design Effects in Two-Level Randomized-Block Designs Without Covariates: 
Pages 34-35 and Formula 18 
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Design Effects in Two-Level Randomized-Block Designs With Covariates 
Pages 38-39 and Formula 22 

Three-Level Randomized-Block Designs Assigning Treatments to Subclusters Without Covariates 
Appendix C C-l and Formula 26 

Three-Level Randomized-Block Designs Assigning Treatments to Subclusters With Covariates 
Appendix C C-2, C-3 and Formula 28 

Three-Level Randomized-Block Designs Assigning Treatments to Individuals Without Covariates 
Appendix C C-5 and Formula 31 

Three-Level Randomized-Block Designs Assigning Treatments to Individuals With Covariates 
Appendix C C-6 and Formulae 33 and 34 
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Appendix C: Computing Power in Three-Level Randomized-Block Designs 

This appendix sketches the computation of statistical power in three-level randomized-block 
designs where the highest level units are considered clusters (random effects). Two variations 
are considered. One design assigns intact classrooms (subclusters) to treatments. The other 
variation assigns individuals within classrooms to treatments. In each case, designs with and 
without covariates are considered. However, the computations are quite similar in every case 
(see Konstantopoulos 2008a). 

Three-Level Randomized-Block Designs That Assign Treatment to 
Subclusters With No Covariates 

Suppose that one is planning a three-level randomized-block experiment that will use a total 
of m clusters (typically schools), each with 2 p subclusters, and will assign p of these 
subclusters (classrooms) within each cluster (school) to a treatment condition and p of the 
subclusters (classrooms) within each cluster (school) to a control condition. Assume that n 
individuals are within each subcluster (classroom), so that mpn individuals are assigned to 
each treatment and the total sample size is N = 2 mpn. 

If the treatment is assigned to subclusters within each of the m clusters, one enters the power 
table with operational sample size N r = m. The operational effect size is 




pn 

-l)p s +(n-l)p c ] ’ 



(26) 



where d is the effect size, ps is the school level ICC, pc is the classroom level ICC, cos is one 
half of the ratio of variance of the treatment effects across clusters to the total cluster level 
variance, p is the number of subclusters (classrooms) assigned to each treatment within each 
cluster (school), and n is the sample size in each cluster. Note that the design effect 



pn 

V 2 [l + {pnco s - \)p s + (n - 1 )p c ] 

can be larger than one, so the operational effect size A T can be larger than the actual effect size 
S. However, the operational sample size N T = m is smaller than the actual sample size mpn 
assigned to each treatment, so the power in the design with clustering usually will not be 
larger than in the design without clustering. 

Using the operational effect size makes it possible to compute statistical power and sample 
size requirements for analyses based on clustered samples using tables and computer 
programs designed for the one-sample t test. For example, entering Table 2 on the row given 
by the operational sample size N r , and finding the column corresponding to the operational 
effect size A T , one can read the power value. 



Appendix C 



C-l 






An alternative method for computing statistical power in clustered designs is to use a 
computer program that has a built-in function that computes the noncentral /-distribution. To 
use such a function to compute the statistical power of a test, one must provide the program 
with a value of an index X (related to the operational effect size) called the noncentrality 
parameter — specifically, 




mpn 

-\)p s +{n-\)p c ~\ 



(27) 



Power is computed using equation (20) for a one-tailed test or (21) for a two-tailed test, as 
above, except that the noncentrality parameter (27) is used with m - 1 degrees of freedom. 

Example. Return to the design considered earlier for a study to evaluate the effects of a 
second-grade supplemental reading intervention. Remember that the assumptions made here 
are intended for illustration only and are not intended to suggest values that will always be 
appropriate for other research studies. Continue to assume that a school-level ICC of about p$ 
= 0.20 is plausible and that a classroom-level ICC of about p c = 0. 13 is plausible. Assume that 
n — 10 students from each of/; = 2 classrooms would receive each experimental condition 
from each school. Previous studies of this intervention suggest that the effect size is likely to 
be d = 0.35 and that effects are likely to be fairly consistent across schools, so that a value of 
cos = 0.5 is plausible (even fairly conservative). The initial plan was for a study that would 
include m = 30 schools. To detennine the statistical power of this design, one first computes 
the operational effect size, using equation (26): 



A r = (0.35) 



2(10)/ 2 



1 + [2(1 0)(0. 5) - 1](0.20) + (1 0 - 1 )(0. 1 3) 



= (0.35)(1.587) = 0.555. 



Entering Table 2 on the row corresponding to N T = m = 30, one sees that the power for 
A T = 0.56 will be between 0.75 (the power forzl r = 0.50) and 0.89 (the power forzi r = 0.60). 
Interpolating between these two values (0.56 is six-tenths of the way from 0.5 to 0.6, so go 
six-tenths of the way between 0.75 and 0.89), one obtains a power level of 0.83. 



Three-Level Randomized-Block Designs That Assign Treatment to 
Subclusters With Covariates 



Now suppose that the analysis has qs (0 <qs<m — 1) cluster-level covariates, qc (0 <qc K m P 
-qs~ 1) subcluster-level covariates, and q w (Q < qw< N - q s - qc~ 2) individual-level 
covariates 4 . For example, a design with q w = 1, qc = 1, and qs = 1 might arise if a pretest were 
used (centered on subcluster means) as an individual-level covariate, subcluster means 
(centered on cluster means) on the pretest were used as a subcluster-level covariate, and 
cluster means on the pretest were used as a cluster-level covariate. 



4 Note that the possibility of having 0 (no) covariates at a given level has been included in the previous section, 
and that adding covariates necessitates modifications of the operational sample size and the operational effect 
sizes. 
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When covariates are used in the design, both the operational effect size and the operational 
sample size are slightly modified. As was the case in the two-level randomized-block design, 
one would enter Table 2 with operational sample size N\ =m-q s . 

The operational effect size is increased to an extent that depends on how much the covariates 
explain between-cluster variance in treatment effects, variance between subcluster means, and 
variance within subclusters. Rw is the amount of within-subcluster variance the covariates 
explain, R T i is the amount of between-cluster (school) variance in treatment effects the 
covariates explain, and Rc is the amount of between-subcluster (classroom) variance the 
covariates explain. One can think of Rw, Rc 2 , and Rj$ 2 as proportions of variance accounted 
for (squared correlations) in the usual way. The covariate-adjusted operational effect size is 



m 

m — q s 

The covariate-adjusted operational effect size is generally larger than the unadjusted design 
effect A (and cannot be smaller than A ) because Rrs~ , Rc , and Rw are generally larger than 
zero. Hence, the term in square brackets in the denominator of the second tenn of the design 
effect 




pn / 2 



1 + ( pnco s l)/? s +(n 1 )P( R w +{pnco s R TS B w ) Ps ) Pc 



(28) 



m 



pn / 2 



m - 1 



q s ]il + (pnco s -\)p s + (n-l)p c - Rl +(pnm s R; s -R^)p s +(nR 2 c -R^)p c 



is generally positive. In computing A a , it may be more convenient to break the denominator 
of the second term of the design effect into two parts. One part, 

A = 1 + (pncos - l)ps + (n - 1 )pc, 

reflects the part due to clustering, and the second part, 

B = Rw 2 + (pncosRrs 2 ~ Rw" )ps + (nRc - Rw ' ) Pc, 

reflects the adjustment for the effects of covariates, so that 






m 



pn/ 2 



m -q s \ A- B 



(29) 



Using the operational effect size makes it possible to compute statistical power and sample- 
size requirements for analyses based on clustered samples using these tables and computer 
programs designed for the one-sample t test. For example, entering Table 2 on the row given 
by the operational sample size Na , and finding the column corresponding to the operational 
effect size A a t , one can read the power value. 
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An alternative method for computing statistical power in clustered designs is to use a 
computer program that has a built-in function that computes the noncentral /-distribution. To 
use such a function to compute the statistical power of a test, one must provide the program 
with a value of an index X (related to the operational effect size) called the noncentrality 
parameter — specifically, 




Power is computed using equation (20) for a one-tailed test or (21) for a two-tailed test, as 
above, except that the noncentrality parameter (30) is used with m - q s - 1 degrees of 
freedom. 

Example. Return to the design considered earlier for a study to evaluate the effects of a 
second-grade supplemental reading intervention. Continue to assume that a school-level ICC 
of about ps = 0.20 is plausible and that a classroom-level ICC of about pc = 0. 13 is plausible. 
Assume that n = 10 students from each of/; = 2 classrooms would participate from each 
school. Previous studies of this intervention suggest that the effect size is likely to be S = 0.35 
and that effects are likely to be fairly consistent across schools, so that a value of cos = 0-5 is 
plausible (even fairly conservative). Suppose that pretreatment reading test scores are 
available and that classroom-centered individual pretest scores, school-centered classroom 
mean pretest scores, and school mean pretest scores will be used as covariates. 

Thus, q w = q c = q s = 1 . There is some evidence that values of Rw = 0.5, Rc = 0.6, and Rs 2 = 
0.8 are plausible. We again assume that about half of the total variance between schools 
explained by the covariates is due to the ability of the covariates to predict treatment effect 
variability, so that Rjs = 0.4. We clarify that these assumptions are intended for illustration 
only and are not intended to suggest values that will always be appropriate for other research 
studies. The initial plan was for a study that would include m = 20 schools. 

To determine the statistical power of this design, one first computes the operational effect size 
via equation (29), using 

A = 1 + [(2)(10)(0.5) - 1](0.2) + (10 - 1)(0. 13) = 3.970 
and 

B = 0.5 + [(2)(10)(0.5)(0.4) - 0.5](0.2) + [(10)(0.6) - 0.5](0.13) = 1.915 
to obtain 




= (0.35)(1.026)(2.205) = 0.79 
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Entering Table 2 on the row corresponding to Na = m - q s =19, one sees that the power for 
A t = 0.79 is slightly less than 0.91 (the power value listed for A T = 0.80). 

One might regard this high statistical power as providing a margin of safety in case any 
assumptions are somewhat optimistic. Alternatively, because power of 0.91 may be higher 
than necessary, one might consider altering the design to decrease costs while maintaining 
acceptable statistical power. For example, one might consider decreasing the number of 
schools to m =15. That decrease would have little effect on the operational effect size (it 
would now be 0.80), but reading power values from the row of the table for N T = m -1 = 14, 
one sees that the power is 0.79. 

Three-Level Randomized-Block Designs That Assign Treatment Within 
Subclusters With No Covariates 

Suppose that one is planning a three-level randomized-block experiment that will use a total 
of m clusters (typically schools), each with p subclusters (classrooms) and 2 n individuals 
within each of these subclusters. The design will assign n individuals within each of these 
subclusters (classrooms) to a treatment condition and n individuals within each of these 
subclusters (classrooms) to a control condition. Thus, mpn individuals are assigned to each 
treatment, and the total sample size is N = 2 mpn. 

If the actual number of clusters in the experiment is m, one enters the power table with 
operational sample size N r = m. The operational effect size is 




pn 

\)p s +(nco c -\)p c '\ ’ 



(31) 



where 8 is the effect size, p s is the school level ICC, p c is the classroom level ICC, cos is one 
half of the ratio of variance of the treatment effects across clusters to the total cluster level 
variance, coc is one half of the ratio of variance of the treatment effects across subclusters to 
the total subcluster level variance, p is the number of subclusters (classrooms) within each 
cluster (school), and n is the number of individuals assigned to each treatment within each 
subcluster. Note that the design effect 



A r=tl r , f n 

V 2 [1 + ( pnco s - 1 ) p s + (i nco c - 1 )p c J 

can be larger than one, so the operational effect size A T can be larger than the actual effect size 
8. However, the operational sample size N T = m is smaller than the actual sample size mpn 
assigned to each treatment, so the power in the design with clustering usually will not be 
larger than in the design without clustering. 

Using the operational effect size makes it possible to compute statistical power and sample 
size requirements for analyses based on clustered samples using tables and computer 
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programs designed for the one-sample t test. For example, entering Table 2 on the row given 
by the operational sample size N r , and finding the column corresponding to the operational 
effect size A T , one can read the power value. 



An alternative method for computing statistical power in clustered designs is to use a 
computer program that has a built-in function that computes the noncentral /-distribution. To 
use such a function to compute the statistical power of a test, one must provide the program 
with a value of an index X (related to the operational effect size) called the noncentrality 
parameter — specifically, 

X = sf fn A T =d\ v mpn (32) 

\ 2|_1 + (. pnco s - 1) p s + (nco c - 1 )p c J 

Power is computed using equation (20) for a one-tailed test or (21) for a two-tailed test, as 
above. 

Example. Return to the design considered earlier for a study to evaluate the effects of a 
second-grade supplemental reading intervention. Continue to assume that a school-level ICC 
of about ps = 0.20 is plausible and that a classroom-level ICC of about pc= 0.13 is plausible. 
Assume that n = 10 students from each of/; = 2 classrooms from each school would be 
assigned to each treatment within each classroom. Previous studies of this intervention 
suggest that the effect size is likely to be S = 0.35 and that effects are likely to be fairly 
consistent across schools, so that a value of = coc= 0.5 is plausible (even fairly 
conservative). The initial plan was for a study that would include m = 30 schools. We again 
clarify that these assumptions are intended for illustration only and are not intended to suggest 
values that will always be appropriate for other research studies. To determine the statistical 
power of this design, one first computes the operational effect size, using equation (3 1): 



A r =(0.35) 



2 ( 10 ) 



2 {l + [2(1 0)(0.5) - 1] (0.20) + [( 1 0)(0.5) — 1] (0. 1 3)} 



= (0.35)(1.736) = 0.607 



Entering Table 2 on the row corresponding to N T = m = 30, one sees that the power for 

A T = 0.61 will be between 0.89 (the power forzl r = 0.60) and 0.96 (the power forzl r = 0.70). 
Interpolating between these two values (0.61 is one-tenth of the way from 0.6 to 0.7, so go 
one-tenth of the way between 0.89 and 0.96), one obtains a power level of 0.90. 
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Three-Level Randomized-Block Designs That Assign Treatment Within 
Subclusters With Covariates 

Now suppose that the analysis has qs (0 <q s < m - 1) cluster-level covariates, qc (0 < q c < mp 
- qs~ 1) subcluster-level covariates, and qw(0 <qw<N- qs- qc~2) individual-level 
covariates. 5 For example, a design with q w = 1, q c = 1, and q s = 1 might arise if a pretest were 
used (centered on subcluster means) as an individual-level covariate, subcluster means 
(centered on cluster means) on the pretest were used as a cluster-level covariate, and cluster 
means on the pretest were used as a covariate. 

When covariates are used in the design, the operational effect size is modified (typically 
increased) to an extent that depends on how much the covariates explain between-cluster and 
between-subcluster variance in treatment effects and within-cluster variance. R w is the 
amount of within-cluster variance the covariates explain, R TS 2 is the amount of between- 
cluster (between-schools) variance in treatment effects the covariates explain, and Rtc is the 
amount of between-subcluster but within-cluster (between-classroom within-schools) variance 
in treatment effects the covariates explain. One can think of R w , Rtc , and Rts as proportions 
of variance accounted for (squared correlations) in the usual way. The covariate-adjusted 
operational effect size is 



m pnTl (33) 

m tfs ^ 1 + ( pnat s — l) p s + ( nco c — 1 )p c — | Rw + ( P nco s^TS ~ Rw ) Ps — ) Pc ] 

The covariate-adjusted operational effect size is generally larger than the unadjusted design 
effect A (and cannot be smaller than A ) because Rts , R re , and R w are generally larger than 
zero. Hence, the term in square brackets in the denominator of the second tenn of the design 
effect 




1 m 


pn/2 


\ m -q s \ 1 


1 + ( pnco s - 1 ) p s +(nco c - \)p c - 


( pnoj s R rs R w ) p s + ( nco c R TC R w ) p c 



is generally positive. 

In computing A/, it may be more convenient to break the denominator of the design effect 
into two parts. One part, 

A = 1 + (pneos- 1 )p s + (nco c - 1 )pc, 

reflects the impact of clustering, and the second part, 

5 Note that the possibility of having 0 (no) covariates at a given level has been included in the previous section, 
and that adding covariates necessitates modifications of the operational sample size and the operational effect 
sizes. 
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B - Rw + (pncosRrs~ ~ R\v )ps + (ncocRrc ~ R w) Pc, 

reflects the adjustment for the effects of covariates, so that 




m 



m — q s 



pn/2 
A-B ' 



( 34 ) 



Using the operational effect size makes it possible to compute statistical power and sample- 
size requirements for analyses based on clustered samples using power tables designed for the 
one-sample t test. For example, entering Table 2 on the row given by the operational sample 
size N/, and finding the column corresponding to the operational effect size A/, one can read 
the power value. 



An alternative method for computing statistical power in clustered designs is to use a 
computer program that has a built-in function that computes the noncentral /-distribution. To 
use such a function to compute the statistical power of a test, one must provide the program 
with a value of an index X (related to the operational effect size) called the noncentrality 
parameter — specifically, 






= s 



mpn / 2 



.(35) 



l + (pnco s l)p s +(nco c 1 )p c R w +(^pnco s R TS R w ^j p s +(nco c R TC R w ^p c 



Power is computed using equation (20) for a one-tailed test or (21) for a two-tailed test, as 
above, except that the covariate-adjusted noncentrality parameter (35) is used with m - 1 - qs 
degrees of freedom. 

Example. Returning to the design considered earlier for a study to evaluate the effects of a 
second-grade supplemental reading intervention, continue to assume that a school-level ICC 
of about ps = 0.20 is plausible and that a classroom-level ICC of about pc = 0. 13 is plausible. 
Assume that n = 10 students would be assigned to each treatment from each of p = 2 
classrooms in each school. Previous studies of this intervention suggest that the effect size is 
likely to be d = 0.35 and that effects are likely to be fairly consistent across schools so that a 
value of cos = = 0.5 is plausible (even fairly conservative). Suppose that pretreatment 

reading test scores are available and that classroom-centered individual pretest scores, school- 
centered classroom mean pretest scores, and school mean pretest scores will be used as 
covariates. Thus, qw = qc= qs=l- There is some evidence that values of Rw' = 0.5, Rtc~ = 0-6, 
and Rts' = 0.8 are plausible. We again assume that about half of the total variance between 
schools explained by the covariates is due to the ability of the covariates to predict treatment 
effect variability, so that Rts 2 = 0.4. We also assume that about half of the total variance 
between classrooms within schools explained by the covariates is due to the ability of the 
covariates to predict treatment effect variability, so that R T c 2 = 0.3. However, we clarity that 
these assumptions are intended for illustration only and are not intended to suggest values that 



C-8 



Appendix C 





will always be appropriate for other research studies. The initial plan was for a study that 
would include m = 15 schools. 

To determine the statistical power of this design, one first computes the operational effect size 
via equation (34), using 

A = 1 + [(2)(10)(0.5) - 1](0.2) + [( 1 0)(0.5) - 1](0. 13) = 3.320 
and 

B = 0.5 + [(2)(10)(0.5)(0.4) - 0.5](0.2) + [(10)(0.5)(0.3) - 0.5](0.13) = 1.33 



to obtain 



A t , = 0.35,/— J 2(10)/2 = (0.35)(1.035)(2.24) = 0.812 . 

" \ 14 v 3.320-1.32 

T 

Entering Table 2 on the row corresponding to Na = m -1 = 14, one sees that the power for 
A/ ~ 0.81 is slightly more than that listed for A/ = 0.80, which is 0.79. 
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Appendix D: Multilevel Models Defining Tests for Treatment Effects 

This appendix describes the multilevel models on which the power computations are based. 
When designs are balanced, the power calculations are exact because exact tests for the 
treatment effect based on the analysis of variance are possible. 

Two-Level Hierarchical Designs With No Covariates. 

Suppose that m clusters of size n are assigned to each treatment. Let T (/ be the / th observation 
in the 7 th cluster. Then the level 1 (individual-level) model is 

Yy = Pot + £ij ,i=l,..., 2m; j = 1, . . .n, 

where Pot is the mean of the 7 th cluster and the sy are independently normally distributed with 
mean 0 and variance aw . The level 2 (cluster-level) model is 



Pm = yoo + yoiTi + rjou i = 1 , ..., 2m, 



where yoo is the grand mean, y 01 is the treatment effect, 7) is a treatment indicator coded 7) = Vi 
for treatment clusters and 7) = -Vi for the control clusters, and the tjm are independently 
normally distributed with mean 0 and variance er/. 

The treatment effect size S is defined as 

*= , 701 . 

/ 2 2 

+ °w 



The intraclass correlation coefficient (ICC) is defined in terms of the variances as 



The test for the treatment effect in this design is a test of the hypothesis that yoi = 0. 

Two-Level Hierarchical Designs With Covariates 



Suppose that m clusters of size n are assigned to each treatment. Let Yy be the j th observation 
in the 7 th cluster. Now suppose that q covariates are modeled at the individual level and r 
covariates are modeled at the cluster level. Thus, the level 1 (individual-level) model is 

Yjj pm ^ p ] X nj + + Pq Xqij + Sy ,i 1, ...,2m,j 1, ...n, 

where pm is the covariate-adjusted mean of the 7 th cluster, /fi, . . . , p q arc the (fixed) effects of 
the individual-level covariate, Xuj, . . . , X qij are the values of the individual-level covariates 
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(centered on cluster means), and the sy are independently nonnally distributed with mean 0 
and variance aw 2 - The level 2 (cluster-level) model is 



Poi - yoo + yoiTi + yo 2 Wn + . . . + yo(r+i)W ri + rjou i - 1, 2m, 

where yoo is the covariate-adjusted grand mean, yoi is the treatment effect, yoi, ... , yo(r+i) are 
the effects of cluster-level covariates, Wn, ..., W ri are the values of the cluster-level covariates 
for cluster i, T, is a treatment indicator coded 7) = l A for treatment clusters and T, = -'A for the 
control clusters, and the rjoi are independently normally distributed with mean 0 and variance 
a as . Note that covariates are treated as having fixed effects. 

In this model, the treatment effect size S is still defined in tenns of the unadjusted total 
standard deviation — that is, 

d= 7 701 . 

/ 2 2 

+a w 



The ICC is also defined in terms of the unadjusted variances as 



P = 



2 , 2 

a s +<7 W 



and the covariate outcome correlations are defined in terms of the adjusted and unadjusted 
variances as 



R 2 s= i 

and 



The test for the treatment effect in this design is a test of the hypothesis that yoi = 0. 

Three-Level Hierarchical Designs With No Covariates 

Suppose that m clusters, each with p subclusters of size n, are assigned to each treatment. Let 
Yyk be the 7 th observation in / h subcluster of the /' th cluster. Thus, the level 1 (individual-level) 
model is 




Yyk = fioy + syk ,i=l,..., 2m; j = 1, . . .p; k = 1, . . .n. 
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where fioij is the mean of the / h subcluster in the / th cluster, and the syk are independently 
normally distributed with mean 0 and variance ow . The level 2 (subcluster-level) model is 

Poy = yooi + rjoijJ = 1, i=l, 2m, 

where yoo is the mean of the /' th cluster and the rjoy are independently normally distributed with 
mean 0 and variance oc . The level 3 (cluster-level) model is 

yooi = Koo + KoiTi + c Oh 1=1, •••, 2m, 

where Koo is the grand mean, noi is the treatment effect, 7) is a treatment indicator coded 
T, = Vi for treatment clusters and T, = -Vi for the control clusters, and the Qn are independently 
nonnally distributed with mean 0 and variance as". 

The treatment effect size S is defined as 



1 2 2 2 

°s + °c + °w 



The ICCs are defined in terms of the variances as 



Ps = 



o~ s + O q + o 



2 

W 



and 



re 2 , 2 , 2 ’ 

<T S + <7 c + <7 W 

The test for the treatment effect in this design is a test of the hypothesis that koi = 0. 

Three-Level Hierarchical Designs With Covariates 

Suppose that m clusters, each with p subclusters of size n, are assigned to each treatment. 

Now suppose that q > 0 covariates are at the cluster level, r > 0 covariates are at the subcluster 
level, and s > 0 covariates are at the individual level. Let Yyk be the k th observation in the y th 
subcluster of the /' th cluster. Thus, the level 1 (individual-level) model is 

Yj/c Poi, + fil 2k] jjk “T “t“ [iq Xqjk &ijk , t L . . . , 2/77 , j 1 , . . .p , k 1 , . . . fl, 

where [loij is the covariate-adjusted mean of the / h subcluster in the / lh cluster, /fi, . . ., are the 
effects of the individual-level covariates (which are fixed effects), X njk , ..., X qijk are the values 
of the individual-level covariates (centered on subcluster means), and the eyk are 
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2 

independently normally distributed with mean 0 and variance a aw ■ The level 2 (subcluster- 
level) model is 

Poij = yooi + yi Zuj + - + y r Zrij + rj 0i j,j = 1, i = 1, 2m, 

where yooi is the covariate-adjusted mean of the 7 th cluster, y /, . . y r are the effects of the level 
2 covariates (which are fixed effects), Z /(/ -, . . . , Z rij are the values of the subcluster-level 
covariates (centered on cluster means), and the r\oij are independently normally distributed 
with mean 0 and variance oac ■ The level 3 (cluster-level) model is 

yooi = xoo + noiTj + n 2 Wu + . . . + n s +iW si +g 0i , i= 1, 2m, 

where noo is the covariate-adjusted grand mean, jtoi is the treatment effect, no 2 , ..., tco(s+i) are 
the effects of the level 3 covariates, T, is a treatment indicator coded 7) = Vi for treatment 
clusters and 7) = -Vi for the control clusters, m , ...,n s +i are effects of the covariates, Wn, ..., 
W si are the values of the cluster-level covariate, and the Cjn are independently normally 
distributed with mean 0 and variance oas ■ 

The treatment effect size d is defined in terms of the unadjusted variances — that is, 

S= , * 01 

/ 2 2 2 
yj a s +a c + (T w 



The ICCs are defined in terms of the unadjusted variances as 



Ps = 



2 2 2 



and 



Pc 



(7 s + <7^. + 77 



2 ‘ 

W 



The covariates outcome correlations are defined in tenns of the adjusted and unadjusted 
variances as 



*!=i~ 



' AS 



and 



Rc = i- 



'AC 
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2 

° AW 



The test for the treatment effect in this design is a test of the hypothesis that noi = 0. 

Two-Level Randomized-Block Designs With No Covariates 

Suppose that there are m clusters of size 2 n and that n individuals in each cluster are assigned 
to each treatment. Let Yj be the / h observation in the i th cluster. Thus, the level 1 (individual- 
level) model is 

Yij = fioi + JhiTi + sy ,i = 1 . .... m;j = 1 . .... In, 

where [hn is the mean of the /' th cluster, T, is a treatment indicator coded 7 } = Vi for individuals 
assigned to treatment and 7) = -Vi for the individuals assigned to control, and the sy are 
independently normally distributed with mean 0 and variance aw • The level 2 (cluster-level) 
model is 



P<)i = yoo + >Kh, i = 1 , -,m 

and 

Pn = yio + >ln,i= 1 , 

where yoo is the grand mean, yio is the (mean) treatment effect, the >jo, are independently 
normally distributed with mean 0 and variance ob 2 , and the rju are independently normally 
distributed with mean 0 and variance 2 a Tx s ■ 

The treatment effect size S is defined as 

d = . yiQ , where erf = a B f+ a Tx s ■ 

2 2 
+a W 



The ICC is defined in terms of the variances as p = 



4 



and the heterogeneity parameter co is defined as 




The test for the treatment effect in this design is a test of the hypothesis that yw = 0. 
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Two-Level Randomized-Block Design With Covariates 

Suppose that there are m clusters of size 2 n and that n individuals in each cluster are assigned 
to each treatment. Now suppose that there are q covariates at the individual level and s 
covariates at the cluster level. Let Yy be the j th observation in the z th cluster. Thus, the level 1 
(individual-level) model is 

Yij = f$ 0 i + PliTij + pi X Uj + " ' + Pq Xqij + Sij ,i = 1 , .... m;j = 1 , . . .. 2 / 7 , 

where p 0i is the covariate-adjusted mean of the / lh cluster, 7), is a treatment indicator coded T u 
= 14 for individuals assigned to treatment and Tij = -V 2 for the individuals assigned to control, 
Pi, . . . , P q are the effects of the individual-level covariates (which are fixed effects), and the sy 
are independently normally distributed with mean 0 and variance oaw" ■ The level 2 (cluster- 
level) model is 



Pa ~ yoo + yoiWn + . . . + yo(r + i)Wn + r/oi, i-1, m, 



and 



Pli = yio + ynWii + . . . + yi(r+i)W ri + r]u, i = 1 , ..., m, 

where yoo is the covariate-adjusted grand mean, yw is the covariate-adjusted mean treatment 
effect, yo2, ■■■, yo(,-+i) are the effects of the level 2 covariates on the mean, yo2, ■■■, yo(r+i> are the 
effects of the level 2 covariates on the cluster-specific treatment effects, the rjoi are 
independently normally distributed with mean 0 and variance ctabi 2 , and the ///, are 
independently normally distributed with mean 0 and variance 2g A txc ■ 

The treatment effect size S is defined as 

*= 7 710 . 

/ 2 2 
+(T W 



The ICC is defined in terms of the unadjusted variances as p = ° s , , 

Cs 

and the heterogeneity parameter co is defined in terms of the unadjusted variances as 




The covariate outcome correlations are defined in terms of the adjusted and unadjusted 
variances as 
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rL = i 



2 

a ATxS 



2 

°TxS 



and 

K = i 




The test for the treatment effect in this design is a test of the hypothesis that yio = 0. 



Three-Level Randomized-Block Designs Assigning Subclusters With No 
Covariates 



Suppose that there are m clusters, each with 2 p subclusters of size n, and that half of the 
subclusters in each cluster are assigned to each treatment. Let Yjk be the k th observation in / h 
subcluster of the 7 th cluster. Thus, the level 1 (individual-level) model is 

Yjk = Poj + Syk ,i= 1 , m;j= 1 , ..., 2 .p;k= 1 , ...n, 

where fioij is the mean of the / h subcluster in the / lh cluster and the Syk are independently 
nonnally distributed with mean 0 and variance ow . The level 2 (subcluster-level) model is 

Poij = yooi + yonTij + rjoijJ = 1 , ..., 2 p, i= 1 , ..., m, 

where ym is the mean of the 7 th cluster, yon is the treatment effect in the 7 th cluster, Ty is a 
treatment indicator coded Ty = Vi for treatment subclusters and Ty = -Vi for the control 
subclusters, and the r\oy are independently normally distributed with mean 0 and variance oc . 

The level 3 (cluster-level) model is 

yOOi = n 0 0 + %0i, 7 = 1 ,..., 777 

and 

yon = 7Tio + £,li, i = 1 , •••, 777 , 

where n 0 o is the grand mean, n 10 is the average treatment effect, the On are independently 
nonnally distributed with mean 0 and variance osbi ', and the c,n are independently normally 
distributed with mean 0 and variance 2n Tx s 2 ■ 

The treatment effect size S is defined in terms of the unadjusted variances — that is, 

^ = = , where ol= ct S bi 2 + otxs ■ 

2 2 2 
V a S +a C +a W 



The ICCs are defined in terms of the variances as 
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Ps 



2 2 2 

Cf s + <7 c + O w 



and p c = 



2 2 2 

U s +O c + <7 W 



and the heterogeneity parameter cos is defined in terms of the variances as 



CO s 2 • 



The test for the treatment effect in this design is a test of the hypothesis that mo = 0. 



Three-Level Randomized-Block Designs Assigning Subclusters With 
Covariates 



Suppose that there are m clusters, each with 2 p subclusters of size n, and that half of the 
subclusters in each cluster are assigned to each treatment. Now suppose that there are also q > 
0 covariates at the cluster level, r > 0 covariates at the subcluster level, and s > 0 covariates at 
the individual level. Let Yyt be the k th observation in / h subcluster of the 7 th cluster. Thus, the 
level 1 (individual-level) model is 

Yjjfc fioij + [P X\ jfc “t" *t" [iq Xqjfc “t“ Sijk , i 1 , . . . , 777, j 1 , . . . , 2 p , /v 1 , . . . Tl, 

where [loy is the covariate-adjusted mean of the / h subcluster in the / lh cluster, /fi, . . ., are the 
effects of the individual-level covariates (which are fixed effects), Xujk, • • • , Xqijk are the values 
of the individual-level covariates (centered on subcluster means), and the eyk are 
independently normally distributed with mean 0 and variance oaw ■ The level 2 (subcluster- 
level) model is 



fioij = yooi + yoiiTij + y 2 z nj + - + y r+1 Z rij + rjoijJ = 1 , 2 p, i = 1 , ..., m, 

where yoot is the covariate-adjusted mean of the 7 th cluster, you is the treatment effect in the 7 th 
cluster, Ty is a treatment indicator coded Ty = 14 for treatment subclusters and Ty = -14 for the 
control subclusters, y 2 , ..., y r +i are the effects of the level 2 covariates (which are fixed 
effects), Ziy, ... , Z ri j are the values of the subcluster-level covariates (centered on cluster 
means), and the rjoy are independently normally distributed with mean 0 and variance oac ■ 
The level 3 (cluster-level) model is 

yooi = xoo + n 0 iWu + . . . + n 0s W si Hot, 7 = 1 ,..., m, 



and 



yon = n 10 + KnWu + . . . + n ls W si +C n , 
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where kqo is the covariate-adjusted grand mean, mo is the covariate-adjusted average treatment 
effect, 7 zoi, . . 7 zos are the effects of the level 3 covariates on the cluster means, nu, ni s are 
the effects of the level 3 covariates on the cluster-specific treatment effects, Wn, ..., W si are 
the values of the cluster-level covariates, the are independently normally distributed with 
mean 0 and variance oasbi , and the Qn are independently normally distributed with mean 0 
and variance 2a ATxS 2 . 

The treatment effect size S is defined in tenns of the unadjusted variances — that is, 



f 2 2 2 

a s + a c + °w 



The ICCs are defined in terms of the unadjusted variances as 



Ps = 



2 2 2 



and 



'-'c 

Pc=— 2 Y 

(7 S + <7 c + <J W 



and the heterogeneity parameter cos is defined in terms of the unadjusted variances as 



co 



_ °TxS 



S —2 • 

O r. 



The covariate outcome correlations are defined in terms of the adjusted and unadjusted 
variances as 



Ks= I" 2 



ATxS 



' TxS 



R z c= 1 - 



' AC 
_2 ’ 



and 



K= i- 






The test for the treatment effect in this design is a test of the hypothesis that nw = 0. 
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Three-Level Randomized-Block Designs: Assigning Individuals Within 
Subclusters With No Covariates 

Suppose that there are in clusters, each with p subclusters of size 2 n, and that half of the 
individuals in each subcluster are assigned to each treatment. Let Yyk be the k th observation in 
the j th subcluster of the z th cluster. Thus, the level 1 (individual-level) model is 

Yijk = Poij + PnjTijk + syk ,i = l,..., m;j = 1, ...p;i= 1, . . .2 n, 

where fioij is the mean of the / h subcluster in the i th cluster, fi m is the treatment effect in the / h 
subcluster in the / th cluster, Tyk is a treatment indicator coded Tjjk Vi for treatment individuals 
and Tyk -Vi for the control individuals, and the Syk are independently normally distributed 
with mean 0 and variance ow . The level 2 (subcluster-level) model is 

Poij = yooi + rjoijJ = 1, i = 1, ..., m 

and 

Ihij = yioi + rjiijJ = 1, -,P, i=l, m, 

where yooi is the mean of the / lh cluster, ym is the treatment effect in the i th cluster, the i]oij are 
independently normally distributed with mean 0 and variance ocbi , and the ij n, are 
independently normally distributed with mean 0 and variance 2otxC ■ The level 3 (cluster- 
level) model is 



yooi ft oo + C Oii i 1 , ..., m 



and 



you - ft io + in, i- 1 , ..., m, 

where noo is the grand mean, nio is the average treatment effect, the Cm are independently 
normally distributed with mean 0 and variance osbi , and the Cj, are independently normally 
distributed with mean 0 and variance ot x s . 

The treatment effect size 8 is defined in terms of the unadjusted variances — that is, 



8 = 



*10 



2 2 2 2 2 
, where a ^ = ocbi + vrxc and a$= ctsbi + cr txs 



2 , 2 



2 2 2 

°s + °c + °w 



The ICCs are defined in terms of the unadjusted variances as 



Ps 2,2.2 

(7 S + (7 C + G\v 



D-10 



Appendix D 




and 



rC 2 2 2 9 

cr 5 + <7 C + o w 

and the heterogeneity parameters cos and coc are defined in terms of the unadjusted variances 
as 




and 

co c 

The test for the treatment effect in this design is a test of the hypothesis that mo = 0. 

Three-Level Randomized-Block Designs: Assigning Individuals Within 
Subclusters With Covariates 

Suppose that there are m clusters, each with p subclusters of size 2 n, and that half of the 
individuals in each subcluster are assigned to each treatment. Now suppose that there are also 
q > 0 covariates at the cluster level, r > 0 covariates at the subcluster level, and s > 0 
covariates at the individual level. Let Yyk be the k th observation in the / h subcluster of the / th 
cluster. Thus, the level 1 (individual-level) model is 

L/Vl ftoij ^ ft lijTijk "t" ft 2 ^ 2ijk + ' ftq+1 Xqyk + £tfk , t L . . . , 171, j 1 , . . ./7, /v 1 , ... 2/7 

where /f;,y is the covariate-adjusted mean of the / h subcluster in the / lh cluster, Pnj is the 
covariate-adjusted treatment effect in the y th subcluster in the / th cluster, T^k is a treatment 
indicator coded Tyk = 'A for treatment individuals and Tyk -'A for the control individuals, fh, 

. ft q +i are the effects of the individual-level covariates (which are fixed effects), Xnjk, ■■■, 
X q ijk are the values of the individual-level covariates (centered on subcluster means), and the 
Cjjk are independently normally distributed with mean 0 and variance a aw • The level 2 
(subcluster-level) model is 

ftoij = yooi + yoi Zuj + "• + y 0r z rij + rjoijJ = 1 , 1 = 1 , m, 

and 

ftlij = yiOi + yn Zlij + "■ + yir Z rij + rjiipj = 1 , :.,p, 1 = 1 , m, 

where yoot is the covariate-adjusted mean of the i th cluster, yim is the covariate-adjusted 
treatment effect in the / th cluster, you ■■■, yor are the effects of the level 2 covariates (which are 
fixed effects) on the subcluster means, yn, . . ., yi r are the effects of the level 2 covariates on 
the subcluster-specific treatment effects, Z /;/ , ... , Z, 7/ are the values of the subcluster-level 
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covariates (centered on cluster means), the rjoij are independently normally distributed with 
mean 0 and variance (Jacb i , and the rjnj are independently normally distributed with mean 0 
and variance oatxc ■ The level 3 (cluster-level) model is 

yooi = 7100 + TCoiWu + . . . + KOsWsi + ioi, i = 1, m, 



and 



yon = 7t 10 + TtnWii + . . . + K 0s W si + in, i= 1, m, 

where 7 too is the covariate-adjusted grand mean, mo is the average covariate-adjusted treatment 
effect, non • x os are the effects of the level 3 covariates on the cluster means, nu, ■ ■■, xi s are 

the effects of the level 3 covariates on the cluster-specific treatment effects, the io, are 
independently normally distributed with mean 0 and variance rr A sBi 2 , and the in are 
independently normally distributed with mean 0 and variance ctatxs 



The treatment effect size d is defined in terms of the unadjusted variances — that is, 



S = 



*10 



222 

a s + °c + °w 



The ICCs are defined in terms of the unadjusted variances as 



and 



Ps 222 

(J s + (Jc + <J W 



_ 

Pc — 2 , 2 ~ T 

a S +(T C + 17 W 



and the heterogeneity parameters cos and coc are defined in terms of the unadjusted variances 
as 

TxS 



CO s - 2 



and 



„ , _ a TxC 

co c 2 



The covariate outcome correlations are defined in terms of the adjusted and unadjusted 
variances as 



K*= 1 - 2 



ATxS 



' TxS 



D-12 



Appendix D 




Rlc = i 

and 



2 

° ATxC 
2 ’ 
U TxC 





The test for the treatment effect in this design is a test of the hypothesis that mo = 0. 
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Appendix E: Glossary of Terms 



Page # Terms: 

20 Cluster-level covariate: A covariate that is measured at the cluster (e.g., school or 
classroom) level. Please see also entry for covariate.” 

4 Clustered sampling: A technique in which a sample of naturally occurring groups 
called clusters (such as schools or residential blocks) are first selected, and then 
individuals are sampled within the selected clusters. Clustered sampling differs from 
stratified sampling in that individuals are selected from only some of the groups 
(clusters) in the population, whereas in stratified samples, individuals are intentionally 
selected from within every group (stratum). 

13 Cohen’s d: The standardized effect size computed by subtracting the mean of the 
control group from the mean of the treatment group, and dividing by a within-group 
standard deviation. Note that in multilevel designs different definitions of Cohen’s d 
are possible (see Hedges, 2007). The current paper divides the mean difference by the 
total standard deviation to compute Cohen’s d. Please also see the term “effect size.” 

5 Conditional inference model: The appropriate statistical model to use when 
conclusions from the data gathered are meant to apply only to the particular clusters 
and subclusters (eg. schools and classrooms) actually studied in the experiment. 
Appropriate only when the population of interest is fixed in both time and place. 
Please see also “Unconditional inference model.” 

12 Covariate: A variable that cannot be affected by the treatment and is expected to be 
correlated with the dependent variable. Ideally covariates are measured before the 
treatment is implemented. Covariates can be used to increase power in multilevel 
designs by decreasing residual variation between and within clusters. 

7 Design effect (common usage): The ratio of the variance of an estimator under a 
particular sampling design to the variance of that estimator under a simple random 
sampling design. 

14 Design effect (current paper): The gain in precision from sampling more than one unit 
per cluster or subcluster. 

2 Effect size: A measure of the strength of the relationship between two variables. The 
current paper uses a version of Cohen’s d as the effect size measure for computing 
statistical power. 

N/A Experiment: A research study where study subjects are exposed to conditions 

manipulated by the researcher, usually with the goal of determining the causal impact 
of some stimuli. 



Appendix E 



E-l 





Terms: 



Page # 

N/A False negative: To fail to identify a treatment effect when one exists or to erroneously 
fail to reject the null hypothesis — also kn own as Type II error. 

N/A False positive: To incorrectly detect a treatment effect when none exists or to 
erroneously reject the null hypothesis — also known as a Type I error. 

34 Heterogeneity parameter (omega): One half of the variation in the treatment effects 
across clusters divided by the total variation across clusters. It is used in power 
computations for randomized-block designs. 

9 Hierarchical design: An experimental design in which entire clusters (e.g., schools or 
classrooms) are assigned to treatments. Thus, every student in a given cluster (a school 
or classroom) receives the same treatment. 

20 Individual-level covariate: A covariate defined at the individual level of a design. 

1 Intraclass correlation coefficient (ICC): A parameter that measures the extent to which 
members of the same cluster or subcluster are more similar to one another than they 
are to members of other clusters or subclusters. In education research, for instance, an 
ICC may measure the extent to which the students in a particular classroom (or 
school) are more alike than students in another classroom (or school). 

3 Nesting: Refers to the idea that certain units are contained within other units (for 
instance, schools are nested within school districts, classrooms are nested within 
schools, students are nested within classrooms). 

18 Noncentral t-distribution: The sampling distribution of the t statistic when the null 
hypothesis is false. It has two parameters, degrees of freedom and the noncentrality 
parameter. It is used in statistical power calculations. 

14 Noncentrality parameter: A quantity determining the distribution of a test statistic 
when the null hypothesis is false. It is the quantity most directly related to the 
statistical power of a test under a given alternative hypothesis. 

2 Operational effect size: A modification to the usual effect size that can be used to 
compute power in multi-level designs using power tables intended for single level 
designs. The operational effect size is the usual effect size multiplied by a design 
effect that depends on features of the complex experimental design, such as the ICC. 

13 Operational sample size: A modified sample size that can be used to compute power in 
multi-level designs using power tables intended for single level designs. The 
operational sample size is closely related to the number of clusters in the experiment. 
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13 Optimal Design: A software program created by Raudenbush, Spybrook, Congdon, 
and Liu that is used for computing statistical power in group-randomized designs. 

I Power analysis: The calculation of the statistical power of a proposed research design 
for the purposes of ensuring that the design under consideration has a high enough 
chance of rejecting the a null hypothesis, when a given true effect of treatment is 
present. 

13 Power table: A table that lists statistical power as a function of sample size and effect 
size. 

10 Quasi-experiment: A research design that compares groups but does not involve 

randomization. Rather the treatment and comparison groups are often matched based 
on pretests or demographic factors, such as socioeconomic status. 

10 Randomized-block designs): A class of experimental designs where units within the 
same cluster are randomly assigned to different treatments. For example, if students 
within the same school were randomly assigned to either of two treatments (e.g., an 
intervention and a control), the design is a two level randomized-block design. 

I I Significance level: The probability of rejecting the null hypothesis when it is true — 
that is, the probability of a Type I error. The significance level 0.05 is often used in 
statistics. 

N/A Statistical inference: The process of deriving some conclusion about a population 
based on a sample. 

1 Statistical power: The probability that the test of the null hypothesis of no average 

treatment effect will successfully reject the null hypothesis when a non-zero treatment 
effect exists. 

4 Stratified sampling: A sampling method in which at least one individual from every 
one of an identified set of subgroups (e.g., schools) in a given population is 
intentionally included in the sample. This differs from clustered sampling, in which 
some subgroups (called clusters, in this case) have no individuals in the sample. 

1 Type I error: Rejecting the null hypothesis when there is no treatment effect. 

1 Type II error: Failing to reject the null hypothesis when a treatment effect is present. 
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5 Unconditional inference model: A statistical model appropriate for use when the data 
in an experiment will be used to generalize to a population of interest that may not be 
fixed in either time or place. For example, this model is appropriate if the teachers and 
classrooms under study are considered a sample of potential teachers that could have 
been assigned to teach the students under study. 

N/A Unity: The number one. 

7 Variance inflation factor: Please see the term “design effect (common usage).” 
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Table 1: Power of the Test for Treatment Effects in the Hierarchical Design and a Function of Operational Sample Size N T 

and Operational Effect Size A T 



Effect size A T 
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0.72 


0.84 


0.92 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


76 


0.07 


0.14 


0.25 


0.41 


0.58 


0.73 


0.85 


0.93 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


78 


0.07 


0.14 


0.26 


0.41 


0.59 


0.74 


0.86 


0.94 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


80 


0.07 


0.14 


0.26 


0.42 


0.60 


0.75 


0.87 


0.94 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


82 


0.07 


0.15 


0.27 


0.43 


0.61 


0.77 


0.88 


0.95 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


84 


0.07 


0.15 


0.27 


0.44 


0.62 


0.78 


0.89 


0.95 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


86 


0.07 


0.15 


0.28 


0.45 


0.63 


0.79 


0.89 


0.96 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


88 


0.07 


0.15 


0.29 


0.46 


0.64 


0.79 


0.90 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


90 


0.08 


0.16 


0.29 


0.47 


0.65 


0.80 


0.91 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 
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Table 2: Power of the Test for Treatment Effects in the Randomized-Block Design and a Function of Operational Sample 



Size N and Operational Effect Size A 



Effect size A 



N 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 



2 0.05 0.05 



3 


0.05 


0.06 


4 


0.05 


0.06 


5 


0.05 


0.06 


6 


0.05 


0.07 


7 


0.06 


0.07 


8 


0.06 


0.08 


9 


0.06 


0.08 


10 


0.06 


0.09 


11 


0.06 


0.09 


12 


0.06 


0.10 


13 


0.06 


0.10 


14 


0.06 


0.11 


15 


0.07 


0.11 


16 


0.07 


0.12 



0.05 


0.06 


0.06 


0.06 


0.07 


0.08 


0.07 


0.09 


0.11 


0.08 


0.11 


0.14 


0.09 


0.13 


0.17 


0.10 


0.15 


0.20 


0.11 


0.17 


0.23 


0.13 


0.19 


0.26 


0.14 


0.21 


0.29 


0.15 


0.22 


0.32 


0.16 


0.24 


0.35 


0.17 


0.26 


0.38 


0.18 


0.28 


0.41 


0.19 


0.30 


0.44 


0.20 


0.32 


0.46 



0.07 


0.07 


0.08 


0.10 


0.12 


0.13 


0.14 


0.17 


0.21 


0.18 


0.23 


0.28 


0.22 


0.29 


0.36 


0.27 


0.34 


0.43 


0.31 


0.40 


0.50 


0.35 


0.46 


0.56 


0.40 


0.51 


0.62 


0.44 


0.55 


0.67 


0.47 


0.60 


0.71 


0.51 


0.64 


0.75 


0.55 


0.68 


0.79 


0.58 


0.71 


0.82 


0.61 


0.74 


0.85 



0.09 


0.09 


0.10 


0.16 


0.18 


0.20 


0.25 


0.29 


0.33 


0.34 


0.40 


0.47 


0.43 


0.51 


0.58 


0.52 


0.60 


0.68 


0.59 


0.68 


0.76 


0.66 


0.75 


0.82 


0.72 


0.80 


0.87 


0.77 


0.85 


0.91 


0.81 


0.88 


0.93 


0.85 


0.91 


0.95 


0.88 


0.93 


0.97 


0.90 


0.95 


0.98 


0.92 


0.96 


0.98 



0.11 


0.12 


0.12 


0.23 


0.26 


0.29 


0.38 


0.43 


0.48 


0.53 


0.59 


0.65 


0.66 


0.72 


0.78 


0.75 


0.82 


0.87 


0.83 


0.88 


0.92 


0.88 


0.93 


0.96 


0.92 


0.95 


0.97 


0.95 


0.97 


0.99 


0.97 


0.98 


0.99 


0.98 


0.99 


1.00 


0.99 


0.99 


1.00 


0.99 


1.00 


1.00 


0.99 


1.00 


1.00 



0.13 


0.14 


0.15 


0.32 


0.35 


0.38 


0.53 


0.58 


0.63 


0.71 


0.76 


0.81 


0.83 


0.88 


0.91 


0.91 


0.94 


0.96 


0.95 


0.97 


0.98 


0.97 


0.99 


0.99 


0.99 


0.99 


1.00 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 



0.16 


0.17 


0.18 


0.41 


0.44 


0.47 


0.67 


0.72 


0.75 


0.85 


0.88 


0.91 


0.94 


0.96 


0.97 


0.97 


0.98 


0.99 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 
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Effect size A T 



/' 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


1.0 


1.1 


1.2 


1.3 


1.4 


1.5 


1.6 


1.7 


1.8 


1.9 


2.0 


17 


0.07 


0.12 


0.21 


0.34 


0.49 


0.64 


0.77 


0.87 


0.94 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


18 


0.07 


0.13 


0.22 


0.36 


0.52 


0.67 


0.80 


0.89 


0.95 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


19 


0.07 


0.13 


0.24 


0.38 


0.54 


0.70 


0.82 


0.91 


0.96 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


20 


0.07 


0.14 


0.25 


0.40 


0.56 


0.72 


0.84 


0.92 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


21 


0.07 


0.14 


0.26 


0.42 


0.59 


0.74 


0.86 


0.94 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


22 


0.07 


0.15 


0.27 


0.43 


0.61 


0.77 


0.88 


0.95 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


23 


0.07 


0.15 


0.28 


0.45 


0.63 


0.79 


0.89 


0.96 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


24 


0.08 


0.16 


0.29 


0.47 


0.65 


0.80 


0.91 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


25 


0.08 


0.16 


0.30 


0.48 


0.67 


0.82 


0.92 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


26 


0.08 


0.17 


0.31 


0.50 


0.69 


0.84 


0.93 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


27 


0.08 


0.17 


0.32 


0.52 


0.71 


0.85 


0.94 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


28 


0.08 


0.18 


0.33 


0.53 


0.72 


0.86 


0.95 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


29 


0.08 


0.18 


0.34 


0.55 


0.74 


0.88 


0.95 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


30 


0.08 


0.19 


0.36 


0.56 


0.75 


0.89 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


31 


0.08 


0.19 


0.37 


0.58 


0.77 


0.90 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


32 


0.09 


0.20 


0.38 


0.59 


0.78 


0.91 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


33 


0.09 


0.20 


0.39 


0.61 


0.80 


0.92 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 
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Effect size A T 



/' 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


1.0 


1.1 


1.2 


1.3 


1.4 


1.5 


1.6 


1.7 


1.8 


1.9 


2.0 


34 


0.09 


0.20 


0.40 


0.62 


0.81 


0.92 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


35 


0.09 


0.21 


0.41 


0.63 


0.82 


0.93 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


36 


0.09 


0.21 


0.42 


0.65 


0.83 


0.94 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


37 


0.09 


0.22 


0.43 


0.66 


0.84 


0.94 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


38 


0.09 


0.22 


0.44 


0.67 


0.85 


0.95 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


39 


0.09 


0.23 


0.45 


0.68 


0.86 


0.95 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


40 


0.09 


0.23 


0.46 


0.69 


0.87 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


41 


0.10 


0.24 


0.47 


0.71 


0.88 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


42 


0.10 


0.24 


0.48 


0.72 


0.89 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


43 


0.10 


0.25 


0.48 


0.73 


0.89 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


44 


0.10 


0.25 


0.49 


0.74 


0.90 


0.97 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


45 


0.10 


0.26 


0.50 


0.75 


0.91 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


46 


0.10 


0.26 


0.51 


0.76 


0.91 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


47 


0.10 


0.27 


0.52 


0.77 


0.92 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


48 


0.10 


0.27 


0.53 


0.77 


0.92 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


49 


0.11 


0.28 


0.54 


0.78 


0.93 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


50 


0.11 


0.28 


0.55 


0.79 


0.93 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 
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Effect size A T 



/' 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


1.0 


1.1 


1.2 


1.3 


1.4 


1.5 


1.6 


1.7 


1.8 


1.9 


2.0 


51 


0.11 


0.29 


0.56 


0.80 


0.94 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


52 


0.11 


0.29 


0.56 


0.81 


0.94 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


53 


0.11 


0.30 


0.57 


0.82 


0.95 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


54 


0.11 


0.30 


0.58 


0.82 


0.95 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


55 


0.11 


0.31 


0.59 


0.83 


0.95 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


56 


0.11 


0.31 


0.60 


0.84 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


57 


0.12 


0.32 


0.60 


0.84 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


58 


0.12 


0.32 


0.61 


0.85 


0.96 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


59 


0.12 


0.33 


0.62 


0.86 


0.97 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


60 


0.12 


0.33 


0.63 


0.86 


0.97 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


61 


0.12 


0.34 


0.64 


0.87 


0.97 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


62 


0.12 


0.34 


0.64 


0.87 


0.97 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


63 


0.12 


0.35 


0.65 


0.88 


0.97 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


64 


0.12 


0.35 


0.66 


0.88 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


65 


0.12 


0.36 


0.66 


0.89 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


66 


0.13 


0.36 


0.67 


0.89 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


67 


0.13 


0.36 


0.68 


0.90 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 
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Effect size A T 



n t 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


1.0 


1.1 


1.2 


1.3 


1.4 


1.5 


1.6 


1.7 


1.8 


1.9 


2.0 


68 


0.13 


0.37 


0.68 


0.90 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


69 


0.13 


0.37 


0.69 


0.91 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


70 


0.13 


0.38 


0.70 


0.91 


0.98 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 
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